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1. Introduction” 


Twelve years ago the Bell Telephone Laboratories released for 
publication their important work on the techniques of spectro- 


* The Royal Institute of Technology Stockholm, Sweden, Report No. 8, 
June 11th 1957, The Speech Transmission Laboratory, G. Fant, ‘Modern 
Instruments and Methods for Acoustic Studies of Speech’. 

** At present research associate at the Speech Transmission Laboratory 
of the Division of Telegraphy and Telephony, Royal Institute of Technol- 
ogy, Stockholm. 

*** Secretary of the VIII International Congress of Linguists. 

* The main purpose of this report is to illustrate techniques for the 
collection, processing, and interpretation of acoustic data on speech. The 
visual display of the sound substance, as an essential of modern analysis, 
is not only an important basis for establishing acoustic correlates to linguistic 
categories, but it also provides a useful substitute for direct observations 
of just how the speech has been produced. A considerable emphasis has thus 
been laid on the physiological interpretation of acoustic data. 

A comprehensive survey of the present status of acoustic phonetics in 
relation to linguistics has been undertaken in ref. 1. 

1 Eli Fischer-Jorgensen, ‘What Can the New Techniques of Acoustic 
Phonetics Contribute to Linguistics ?’ Proc. of VIII Int. Congress of Linguists, 
Oslo 1958. 
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graphic portrayal of speech. One year later the book Visible 
Speech was published marking a new era in experimental phonetics. 

Expectations ran high. It was apparent that here was a power- 
ful method of performing objective studies of the essentials of the 
sound-substance as contained in the speech wave. A few years later 
the instrument was made commercially available by the Kay 
Electric Company under the name of ‘Sonagraph’. The research 
could begin. 

The start was fairly promising. The basis of modern acoustic 
phonetics was explained by M. Joos* who was the first linguist 
to make use of the sound spectrograph. His book is a valuable 
introduction to the physics of speech, the theory of frequency 
analysis, and the theory of speech perception, but it does not go 
very far into the general applications of spectrographic techniques 
for studies of speech and language. The spectrographic patterns 
for vowels were extensively treated by Potter and Steinberg’, 
Peterson®, Peterson and Barney’, R. L. Miller’, all connected with 
the Bell Telephone Laboratories at that time. These investigations 
were designed to determine the physical correlates of vowel quality, 
the statistical spread’ of data from large groups of speakers, and 
methods of ‘normalizing’® the data for a single speaker to extract 
the ‘information bearing elements’®. Often, identity of vowel colour 
as an adjunct to phonemic distinction was utilized as listeners’ 
response criteria. 


2 Bell Telephone Laboratories, ‘Technical Aspects of Visible Speech’, 
Bell Telephone System Monograph B-1415, 1946, and J. Acoust. Soc. Am. 
17, 1946, 1—89. (The public disclosure was one year earlier: R. K. Potter, 
‘Visible Patterns of Sound’, Science, Nov 9, 1945). 

3 R. K. Potter, A.G. Kopp, H.C. Green, Visible Speech, New York 1947. 

4 M. Joos, ‘Acoustic Phonetics’, Language 24, 1948, 1—136. 

5 R. K. Potter, J. C. Steinberg, ‘Toward the Specification of Speech’, 
J. Acoust. Soc. Am. 22, 1950, 807—820. 

6 G. E. Peterson, ‘The Phonetic Value of Vowels’, Language 27, 1951, 
541—553. 

7 G. E. Peterson, H. L. Barney, ‘Control Methods Used in a Study of 
the Vowels’, J. Acoust. Soc. Am. 24, 1952, 175—184. 

8 R. L. Miller, ‘Auditory Tests with Synthetic Vowels’, J. Acoust. Soc. 
Am. 25, 1953, 114—121. 

®* G. E. Peterson, ‘The Information Bearing Elements of Speech’, J. 
Acoust. Soc. Am. 24, 1952, 629—637. 
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There exists an extensive literature!-®5 on acoustic analysis 
of speech performed with simpler means and before the introduction 
of the modern sound spectrograph. The new methods have, how- 
ever, provided much more complete data. 

Not much work on consonants was made except for the basic 
presentation in the book Visible Speech’. This suffers, however, 
from the limitation of the frequency range of analysis to an upper 
limit of 3400 c/s corresponding to common telephone practice. An 
upper limit of 8000 c/s is needed for an unambiguous comparative 
description of unvoiced continuants and stops. 

One of the most important events in experimental phonetics in 
the last ten years is the research into spectrographic speech patterns 
by means of synthetic speech. The Haskins Laboratories have made 


10 H. Pipping, Om klangfdrgen hos sjungna vokaler, Helsinki 1890. 

11 R. Paget, ‘The Production of Artificial Vowel Sounds’, Proc. Roy. 
Soc. A 102, 1923, 75, and Human Speech, London 1930. 

12 1. B. Crandall, ‘Sounds of Speech‘ Bell System Techn. J. 4, 1925, 
586—626. 

13 C. Stumpf, Die Sprachlaute, Berlin 1926. 

4H. Fletcher, Speech and Hearing, New York 1929 and 1953. 

15 J. C. Steinberg, ‘Application of Sound Measuring Instruments to 
the Study of Phonetic Problems’, J. Acoust. Soc. Am. 6, 1934, 16—24. 

16 F. Trendelenburg, Kldnge und Gerdusche, Berlin 1935, and Einfihrung 
in die Akustik, Zweite Auflage, Berlin 1950, 138—150, 359—362. 

17 L. Barczinski, E. Thienhaus, ‘Klangspektren und Lautstarke deut- 
scher Sprachlaute’, Arch. Néerland. Phon. Exp. 11, 1935, 47—68. 

18 Don Lewis, ‘Vocal Resonance’, J. Acoust. Soc. Am. 8, 1936, 91. 

19 A. Sovijarvi, ‘Die wechselnden und festen Formanten der Vokale 
erklart durch Spektrogramme und Réntgengramme der finnischen Vokale’, 
Proc. III Int. Phonet. Cong., 1938, 407—420. 

20 C. Chiba, M. Kajiyama, The Vowel, Its Nature and Structure, Tokyo 
1941. 

21 S. Smith, ‘Analysis of Vowel Sounds by Ear’, Arch. Néerland. Phon. 
Exp. XX, 1947, 78—96. 

#2 T. Tarnéczy, ‘Resonance Data Concerning Nasals, Laterals, and 
Trills’, Word 4, 1948, 71—77. 

23 C. G. M. Fant, ‘Analys av de svenska vokalljuden’, L. M. Ericsson 
protokoll H/P 1035, 1948. 

24 C. G. M. Fant, ‘Analys av de svenska konsonantljuden’, L. M. Erics- 
son protokoll H/P 1064, 1949. 

25 C. G. M. Fant, ‘Discussion of paper read by G. E. Peterson at the 
1952 Symposium on the Applications of Communication Theory’. Publ. in 
Communication Theory, ed. W. Jackson, London 1953, 421—424. 
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extensive studies**?? of the relevance of various pattern aspects 
by a technique of spectrographic playback by which handpainted 
stylized spectrograms are converted to speech. This project aiming 
at the evaluation of spectral characteristics has been performed with 
such unbiased attitude towards the actual composition of speech 
that some spectrographic pattern aspects can be said to have been 
rediscovered through synthesis. At present the wealth of the empiric 
knowledge gained by this research on stylized spectrograms con- 
stitutes a very essential conceptual reference for the language of 
visible speech as conceived by phoneticians. 

But what happened with the spectrograph? Is it inadequate 
for collecting data of consonants? It might have been expected that 
several complete language and dialectal studies on spectrographic 
basis might have been completed since the spectrograph was intro- 
duced. This is a rather complex point to discuss. First of all the 
Sonagraph has some technical imperfections such as inadequate 
portrayal of very weak sounds. The sectioning device for performing 
detailed studies of narrowly time-limited intervals of the speech 
wave is not well adapted to the study of sound intervals of a short 
duration. A special gating technique was therefore adopted by 
Halle, Hughes, and Radley?*-* for the study of English and Russian 
stops, affricates, and fricatives. An earlier survey of Swedish con- 
sonants** was partially based on a bandpass oscillogram sampling 
technique similar to the one described in section 3. Good qualita- 
tive descriptions of stop sounds on the basis of Sonagraph tech- 
niques can, however, be made as shown by the extensive investiga- 
tions of Danish stop sounds by Eli Fischer-Jorgensen.** An octave 
band oscillographic technique has been used by Tarnéczy.*! 

26 F. S. Cooper et al., ‘Some Experiments on the Perception of Syn- 
thetic Speech Sounds’, J. Acoust. Soc. Am. 24, 1952, 596—606. 

27 P. Delattre, A. M. Liberman, F. S. Cooper, ‘Acoustic Loci and Tran- 
sitional Cues for Consonants’, J. Acoust. Soc. Am. 27, 1955, 769—773. 

23 M. Halle, ‘The Russian Consonants. A Phonemic and Acoustical 
Investigation’, Dr. Phil. Thesis, Harvard University, Dec. 1954. 

29 G. W. Hughes, M. Halle, ‘Spectral Properties of Fricative Consonants’, 
J. Acoust. Soc. Am. 28, 1956, 303—310. 

30M. Halle, G. W. Hughes, J. P. A. Radley, ‘Acoustic Properties of 
Stop Consonants’, J. Acoust. Soc. Am. 29, 1957, 107—116. 

31 T. Tarnéczy, ‘Die akustische Struktur der stimmlosen Engelauten’, 
Acta Linguistica 4, Budapest 1954, 313—349. 

82 E. Fischer-Jorgensen, ‘Acoustic Analysis of Stop Consonants’, 
Miscellanea Phonetica II, 1954, 42—59. 
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A Sonagraph record of an utterance provides the investigator 
with a complex of interesting facts, some of which can be appreciated 
after a short learning period only. The difficulty is to translate 
the visual patterns into short concise statements as scientific 
records digestible to other researchers. The technical metalanguage 
does not seem to be very well developed yet. 

Actually there are not very many linguists that have a spectro- 
graph and even a less number of people that also possess a sufficient 
familiarity with the techniques of a systematic ordering of the spec- 
trographic data. Engineers and other nonlinguists that are more 
accustomed to the handling of acoustic data, will generally not 
enter very deep into descriptive phonetics work. 

One of the greatest problems in speech analysis is the mass of 
data to be dealt with. According to communication theory** an 
investigator who is equipped with a spectrograph or oscillograph 
capable of recording signals up to an upper frequency limit of 
W c/s has to collect and pay attention to a number of W numerical 
quanta per second of the speech to be analysed. Since a frequency 
range of W = 8000 c/s is needed in order to avoid loss of information 
on the most high-pitched sounds, it is apparent that it is not 
possible to make maximal use of the recorded data. 

There remain three alternatives for the further utilization of 
the recorded data. One is to perform an approximation according 
to the tolerances set by human hearing. This level of specification 
is of theoretical interest only since these tolerances are not very 
well known for connected speech or a connected chain of speech- 
like stimuli. We can, however, make use of some auditory criteria 
for maximum estimates of the accuracy needed for a specification.*4—*¢ 
The second level of specification is with reference to the symbols of 
a narrow phonetic transcription. This is also a rather complex 
undertaking if a detailed mapping is attempted, but it is of course 
advisable to start out with extensive references to this level before 


33 C. Cherry, On Human Communication, London 1956. 

34 J. L. Flanagan, ‘Estimates of the Maximum Precision Necessary in 
Quantizing Certain Dimensions of Vowel Sounds’, J. Acoust. Soc. Am. 29, 
1957, 533—534. 

35 J. L. Flanagan, ‘Difference Limen for Vowel Formant Frequency’, 
J. Acoust. Soc. Am. 27, 1955, 613—617. 

36 G. A. Miller, ‘Sensitivity to Changes in the Intensity of White Noise 
and its Relation to Masking and Loudness’, J. Acoust. Soc. Am. 19, 1947, 
609—619. 
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the redundancy extraction process is carried further with the pur- 
pose of describing the physical basis of language as a communica- 
tion system which is the third level of specification. 

The operational efficiency®’ of a distinctive feature approach in 
speech analysis is due to recurrence of more or less the same di- 
stinction in several minimal pairs. This is the easiest method of 
making speech analysis manageable, especially when speech pro- 
‘duction, the speech wave, and speech perception are to be inter- 
related. 

One of the main arguments in favour of any phonemic approach 
is the greater consistency with which subjects respond when put to 
the task of transcribing real or artificial speech. When the specifi- 
cational frame is extended to include narrow phonetic transcrip- 
tions, the listening test is more difficult to carry out. Phonetic judg- 
ments are always influenced by the listener’s language, see further 
Eli Fischer-J@rgensen.! 

A minimum redundancy specification of acoustic correlates 
to distinctive sound features should not be confused with a general 
description of speech on the acoustic level. The phonemic ordering 
can be done on the basis of the phonetic facts, but not vice versa. 
If a statement of the acoustic correlates to a distinction shall be 
something more than empty words for the language student, he 
must be well acquainted with the general appearance of spectro- 
graphic pictures. The tentative and very condensed presentation 
of the distinctive features offered by Jakobson e¢ al.8-*® is not in- 
tended as an introduction to modern spectrographic techniques or 
as an introduction to the art of reading visible speech. 

The process of learning the essentials of the visible speech pat- 
terns as displayed by a spectrograph is facilitated by a constant 
correlation of known articulatory data to the observed acoustic 
data. Besides the theoretically complete predictability of the speech 
wave from speech production, there exists a fair degree of predict- 
ability in the reverse direction permitting the investigator to make 
physiological interpretations of spectrograms.*® Compensatory 

37M. Halle, ‘The Strategy of Phonemics’, Word 10, 1954, 197—209. 

38 R. Jakobson, C. G. M. Fant, M. Halle, ‘Preliminaries to Speech Anal- 
ysis’, Acoustics Laboratory, M. I. T., Techn. Report No. 13, 1952. 

3® R. Jacobson, M. Halle, Fundamentals of Language, ’S-Gravenhage 
1956. 


40 P. Delattre, ‘The Physiological Interpretation of Sound Spectro- 
grams’, PMLA LXVI, 1951, 864—875. 


11 


forms of articulation are not a serious objection since a compen- 
sation is never complete. Rather accurate estimates can be made 
of what articulatory movements have occurred, given the full 
evidence of the spectrographic record. This is a promise rather than 
the actual state of our knowledge. Several essentials of the causal 
relations between articulation and speech wave have been re- 
vealed, 4!—“4 partially with the aid of electrical speech synthesizers that 
are configurative analogs of the human vocal tract,**-4” but there 
remains much to be done in this interesting area of acoustic 
research. The complications are that a single articulatory variable 
affects several variables in the spectrogram. Conversely any specific 
variable in the spectrographic picture is generally related to several 
of the articulatory variables. Similar complications exist when re- 
lating acoustic data to perception or to the units of the speech 
message. 

According to Joos,* acoustic phonetics was in its infancy in 1948. 
We are actually still in an early period of development in spite of 
our new instruments for speech analysis and synthesis and our 
better understanding of the fundaments of speech communication. 
We do not yet know the significance of all details within spectro- 
graphic pictures. Even our basis for stating what can be heard and 
what cannot be heard is rather unsatisfactory since our knowledge 
is mostly restricted to simple stimuli such as white noise and sine 
waves. One should not uncritically apply these data to the theory 
of speech perception. Some basic work has been performed on the 
auditory discrimination of small pattern differences in speech-like 
stimuli,*4-** but there is almost no data collected for connected 
speech. The tolerances are probably larger in connected speech. 

See reference 20. 

42 H. K. Dunn, ‘The Calculation of Vowel Resonances and an Electrical 
Vocal Tract’, J. Acoust. Soc. Am. 22, 19350, 740—753. 

43 C. G. M. Fant, ‘Transmission Properties of the Vocal Tract with 
Application to the Acoustic Specification of Phonemes’, Acoustics Laboratory, 
M. I. T., Techn. Report No. 12, 1952. 

44 Jw. van den Berg, Physica van de stemvorming met toepassingen, 
"S-Gravenhage 1953. 

46 K. N. Stevens, S. Kasowski, C. G. M. Fant, ‘An Electrical Analog of 
the Vocal Tract’, J. Acoust. Soc. Am. 25, 1953, 734—742. 

46 K. N. Stevens, A. S. House, ‘Development of a Quantitative De- 
scription of Vowel Articulation’, J. Acoust. Soc. Am. 27, 1955, 484—493. 

47 K. N. Stevens, A. S. House, ‘Studies of Formant Transitions Using 
a Vocal Tract Analog’, J. Acoust. Soc. Am. 28, 1956, 578—585. 
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Vowel qualities can be objectively measured in terms of formant 
patterns, but we have not yet arrived at quite satisfactory methods 
for presenting the data in a simple form retaining the information 
from several formants and eliminating personal scale factors. We 
still argue about the significance of auditory quality scales,** and 
we have not yet made very serious attempts to establish inter- 
nationally accepted quality standards on an acoustic basis. A first 
attempt has been made by the Haskins group.*® However, such 
standards will eventually come. Our present speech synthesizers 
are capable of producing very natural vowels, and we should be 
able to standardize their performance so that an objective quality 
norm becomes available as a support for narrow phonetic tran- 
scriptions. 

An acoustic description on the basis of the speech wave will 
probably not replace the articulatory descriptions of speech pro- 
duction as the most important physical reference for the linguists’ 
phonetic considerations,! but it will serve as an increasingly im- 
portant supplement. Acoustic phonetics of to-day is rather techni- 
cal in character due to the fundamental developments made by com- 
munication engineers. Several linguists, psychologists, physicists, 
and others have, however, made substantial contributions to this 
field. 

Most of the larger research groups are at present attached to 
technical institutions, but they generally comprise active members 
from various disciplines which have a common interest in the 
theory of speech. A requirement for successful cooperation is, as 
pointed out by Eli Fischer-Jgrgensen,! that they also have a com- 
mon knowledge in the means and principles of analysis. The scope 
of the present activities is to investigate the whole communication 
chain to the extent that it can be observed and to see how it works 
by analysis of the signal structure at various stages within the 
chain and by analytical and statistical studies of the codes for trans- 
lating data from one stage of specification to any other. This is 
the ambition, but we have only seen the start of this extended 
research. 

We cannot carry out very successful comparative language stud- 
ies on an acoustic basis before we have made a substantial ad- 


48 P. Ladefoged, ‘The Classification of Vowels’, Lingua 5, 1956, 113. 
4° P. Delattre, A. M. Liberman, F. Cooper, ‘Voyelles synthétiques 4 
deux formantes et voyelles cardinales’, Le Maitre Phonétique 96, 1951, 30—36. 
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vance in general phonetics. Phonetic specialists are still too busy 
investigating the capabilities of their new instruments and the 
significance of acoustic data to devote their time to large field studies 
on an acoustic basis. It is easier to get started with simpler instru- 
ments or limited scopes of the analysis, e.g. to measure acoustic 
correlates to prosodic features. New instruments are especially 
welcome in this field. Several extensive dialectal studies of into- 
nation have been carried out with the classical kymograph, 52-5 
for instance those of E. A. Meyer®! and E. Selmer®® who have devot- 
ed years of hard labour and personal enthusiasm to its use. The 
kymograph is ready for retirement as a test for intonation studies 
but might still find some use for physiological recordings®® and 
laboratory demonstrations. 

An extensive collection of dialectal speech material by means 
of tape and gramophone recordings is undertaken in many countries, 
for instance5*-5’, and some acoustic processing of the data has been 
attempted.5’-5§ Besides the central problems of what to measure 
and how to interpret the data, there is also the problem of how to 
measure large quantities. It can be objected that in many instances 
the limiting time factor is not the instrument but the time it takes 
to study and systematize the measured data. This is especially 
true of data that shall be prepared for publication in graphical form. 
It is, however, good practice to start with very extensive studies 

50 E. W. Selmer, ‘Die metodische Verwertung der Tonhdhenkurven’, 
Opuscula Phonetica VIII, 1930. 

51 E. A. Meyer, Die Intonation im Schwedischen 1 (1937); II (1954). 

52 G. Panconcelli-Calzia, Die experimentelle Phonetik in ihrer Anwen- 
dung auf die Sprachwissenschaft, Berlin 1924. 

53 E. W. Selmer, ‘La Phonétique Expérimentale’, La Voix 1953, 43—55. 

84 P. Menzerath, A. de Lacerda, Koartikulation, Steuerung und Laut- 
abgrenzung, Berlin—Bonn 1933. 

55 R. H. Stetson, Motor Phonetics, Amsterdam 195}. 

56 Svenska landsmAlsarkivet. See F. Hedblom, ‘Recording in Dialect 
Investigation in Sweden’, to be publ. in Phonetica 1958. 

8? E. Zwirner, ‘Lautdenkmal der Deutschen Sprache’, Zeitschrift fir 
Phonetik 9, 1956, 3—13. 

58 FE. Zwirner, A. Maack, W. Beetghe, ‘Vergleichende Untersuchungen 
iiber konstitutive Faktoren deutscher Mundarten’, Zeitschrift fir Phonetik 
9, 1956, 14—30. References are given here to earlier works of the members 
of the Zwirner school. 

59 G. E. Peterson, ‘Phonetics, Phonemics, and Pronunciation: Spec- 
trographic Analysis’, Georgetown University Monograph Series on Language 
and Linguistics, Monograph No. 6, 1954. 
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of all observable phases of the data for a few representative cases 
and then decide on which measurements to take when performing 
statistical studies of the major part of the material. High capacity 
analysing techniques then become valuable. If the cost of collecting 
data is low, it will be possible to extend the investigation and get 
a statistically better valid basis. One can also afford to let the 
subject talk under less strain. This can, however, always be achieved 
when the recording and the analysis are kept apart. 

The Sonagraph is valuable in this respect, but not ideal. Several 
recent developments at the Royal Institute of Technology in Stock- 
holm, described in section 3, are aimed at the processing of very 
large quantities of speech. These techniques include simultaneous 
intensity and automatic pitch display on a direct recording oscillo- 
graph and the use of a 48-channel spectrograph which provides an 
instantaneous spectral analysis in the form of a running record 
on 35 mm film. These latest instruments have an analysing capac- 
ity of the order of a hundred times that of the Sonagraph. One 
interesting field of study available by modern spectrographic 
techniques is concerned with the great differences between care- 
fully enunciated speech and the common everyday speech with its 
faster tempo and frequent omissions and assimilations. The spec- 
trographic record cannot be used as a substitute for an aural 
transcription, but it is a very valuable supplement for any language 
student who wants to check the physical evidence from the produc- 
tion of a particular utterance. 


2. Collection and Interpretation of Spectrographic Data 


2.1 The Sonagrapn. Voice Periodicity. The Formants of Voiced Sounds 

The Visible Speech sound spectrograph developed at the Bell 
Telephone Laboratories and now commercially available under the 
name of Sonagraph has been described in detail elsewhere.? Its 
application for phonetic research has been treated by G. E. Peterson 
in several articles, e.g. ®5*, see also ref.! 

Speech utterances of 2.4 seconds length can be handled with the 
Sonagraph. The analysis procedure takes a time of 5 minutes 
during which the spectral picture is traced on a teledeltos paper 
attached to a cylindrical drum. A spectrogram obtained by this 
instrument has the dimension time horizontally and frequency 
vertically. The distribution of spectral energy within this time 


15 


frequency space is represented by a pattern of variable density 
black marking. This intensity representation is only qualitative 
and does not cover a very large range. Weak sounds are often 
portrayed rather incompletely. The visible pattern of a sound is 
thus essentially determined by its more intense spectral compo- 
nents. A single energy maximum is called a formant. The formants 
of voiced sounds have a fine structure, the shape of which depends 
on the frequency width, i.e. the bandwidth of the analysing filter. 
Two alternative bandwidths can be utilized. One is 300 c/s and 
the other 45 c/s. When the wider filter is used and providing the 
fundamental pitch of the speaker is less than the filter bandwidth, 
which is the case for male voices, the fine structure of voiced sounds 
will show up as very narrow vertical lines intercepting the picture, 
see spectrograms B and D of Fig. 1. Each of these indicates the 
beginning of a voice fundamental period. Since the broad band 
spectrogram displays the spectrum period after period it is appar- 
ent that it provides a basis for measuring the fundamental pitch 
F, of the voice, i.e. the number of complete periods per second. 
Denoting the duration of a complete period, by Ty 


Fy 1/Ty (1) 


If on the other hand the narrow filter is utilized, the periodicity 
shows up in a quite different way. Now the fine structure within 
Aa formant appears as a few harmonics ordered in horizontal direc- 
tion providing the pitch is constant. The intonation can thus be 
studied by tracing the location of any harmonic throughout the 
picture. The fundamental pitch Fy is the distance in frequency 
between any two adjacent harmonic lines or the frequency position 
of any harmonic divided by its order number, see spectrogram A 
of Fig. 1. The harmonics can be separated more clearly if the fre- 
quency scale has been expanded in the analysis, as in Fig. 14. 

It should be observed that the harmonic structure is a sole 
property of the larynx source and that the frequency location of 
the centre of a formant, in short its frequency, is related to the 
shape of the vocal cavities only. To a large extent formant frequen- 
cies and harmonics are mutually independent, and the frequency of 
a formant may fall anywhere between two harmonics. A section, 
see C and E of Fig. 1, is needed for the quantitative study of the 
amplitude of each harmonic. A Sonagraph is capable of produc- 
ing a maximum of 6 sections at a time within the 2.4 seconds long 


Fig. 1.- Spectrograms and sections obtained with a Sonagraph analyser. 
Speech material [didedadodu]. ABC American subject, DE Swedish subject. 
The spectrograms B and D are broad band spectrograms; narrow band filter 
was used for A and for sections C + E. The time locations of the sections 
are indicated by arrows under the spectrograms. Observe the greater F, 

F, separation in the American vowels and the low intensity and low fre- 
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quency of the second formant of the Swedish [i]. The reinforced 2nd harmonic 
in the spectra of both speakers should not be mistaken for F1. 


speech sample stored in the initial recording. Samples can not be 
taken closer than 1/25 seconds. The section constitutes a sample 
of the utterance from a specific time interval specified by its loca- 
tion and time length. The latter is of the order of the reciprocal 
value of the filter bandwidth, i.e. 1/45 seconds when the narrow 
filter is used. This integrating effect or energy storage is due to 
the inertia of the filter. A very narrow filter reacts very slowly 
to changes in the signal. 

The narrow band analysing filter should be used for taking 
sections of voiced sounds, and the broad band filter should be used 
for analysis of unvoiced sounds. Since most Sonagraphs lack acces- 
sories for continuous adjustment of the place in time from which 
a section is taken it is not recommended to attempt taking sections 
of stops. Even if fine adjustment can be made it will be hard to 
estimate the time limitations of a section. 


? 


17 

vee: 

| 

a 4 a a 4 ° 4 u 


18 


A formant without reference to its dimensions is denoted by Fn = 
Fl, F2, F3, etc., where n stands for the number of the formant count- 
ing the formant of lowest frequency as n=1. The frequency of the 
formant, F,, = F,, Fo, Fs, etc., can be measured either from the 
centre of the visual formant band in a broad band spectrogram or 
by the location of the peak of an envelope curve drawn to enclose 
the peaks of the harmonics in a narrow band section. All intensity 
values are generally expressed as relative values in a logarithmic 
scale with decibel, abbreviated dB, as the unit. To indicate the use 
of a logarithmic unit the term level is utilized. Formant levels are 
thus denoted L,. The formant level is identical with the peak level 
of the envelope enclosing the formant. The bandwidth B,, is deter- 
mined as the distance in frequency between two -points on the 
spectrum envelope, one on each side of the formant peak, where 
the envelope level is 3 dB below the level of the peak. 

The Sonagraph thus provides data on the fundamental pitch 
and frequency composition of speech. By means of a special accessory, 
the amplitude display unit, the total intensity of the speech wave 
can be displayed synchronously with the spectrogram, see Fig. 
13-15. The amplitude display unit produces a logarithmic intensity 
measure, and the intensity level L in dB is thus proportional to 
the height of the amplitude curve. A point on this curve represents 
an integrated average over a time of the order of 10 milliseconds. 

The following range of formant frequencies will be found for 
non-nasal voiced sounds produced by an average male voice. 


F, = 150-850 c/s 
F, = 500-2500 c/s 
F, = 1500-3500 c/s 
F, = 2500-4500 c/s 


Formant bandwidths range from 40-250 c/s. The average value 
of both B, and B, is of the order of 75 c/s. 

Females have on the average 17 % higher formant frequencies 
than men. This statistical difference is physiologically conditioned 
by the total length of the vocal cavities from the glottis to the lips 
which is smaller by this amount, comparing female to male data. 
The average distance between formants is c/2 1, where c = 35300 
cm/s is the velocity of sound and ], is the total length of the vocal 
cavities. Since 1], is of the order of 17.5. cm c/2 1, comes close to 
1000 c/s. The width of a speaker’s vocal tract does not have the 
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same influence on these relations. Children have even higher 
formant frequencies. The male-female differences show typical 
variations with the particular sound and formant under observation 
as shown earlier.*° The fundamental pitch Fy is nearly 1 octave 
higher in female than in male voices. Typical mean values are F, 
= 220 c/s and Fy = 120 c/s respectively; children of the age of 
10 have an average Fy of 300 c/s. The normal extent of tone vari- 
ation, i.e. of Fy, in connected speech is of the order of 1 octave. 
Formant bandwidths are not very critical for the identity of vowels, 
and they are fairly closely correlated with the particular pattern 
of formant frequencies, i.e. basically with articulation. Formant 
levels are completely predictable from a specification of the speaker’s 
voice source and the data on the formant frequencies and band- 
widths. The vowel formants are thus the major physical determi- 
nant of vowel quality. The first two formants are the most impor- 
tant, except in the case of front vowels where F, must be included 
in the specification (see further Eli Fischer-Jargensen).? 


2.2 Two-dimensional Representation of Vowels 

The composite data of F, versus F, for a group of Swedish 
speakers comprising 7 male voices and 7 female voices is shown 
in Fig. 2. There is a considerable overlap between phoneme areas 
in this acoustic vowel diagram.® A male 4, [¢] can come close to 
a female 6, {@] in the diagram. The overlap is primarily due to the 
physiologically conditioned scale factor discussed above and is 
partly due to the omission of formants higher than the second. 
The F, versus F, plot provides, however, sufficient evidence for 
stating the main relations between the phonemes of a single speaker 
If two phonemes are opposed to each other.in the front-back di- 
mension, there will be found a smaller distance F,-F, in the 
more retracted member of the opposition. Two phonemes whose 
articulatory opposition is tongue height must differ in F,; thus 
the lower vowel has the higher F,. The effect of lip-rounding or 
lip-protrusion is always to lower the frequencies of all formants 
by smaller or greater amounts. Thus an articulatory opposition of 
rounded versus unrounded can be translated into the acoustical 
measure of F, + F, which is always lower for the more rounded 
member. F, + F, + F, is of course an even more effective criterion. 


60 Compare corresponding data in ref. 5,7. 
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Fig. 2. Acoustic representation of Swedish 
vowels in a diagram of F, versus F, based 
on measurements of 7 male speakers and 7 
female speakers. The great overlap between 
the areas occupied by closely related front 
vowels is primarily due to the mixture of 
male and female data but also to the inade- 
quacy of representation by means of two 
formants only. The transcription is based 
on the vowel symbols of the Swedish alpha- 
bet with the addition of a subscript 1 for 
long vowels, 2 for short vowels, 3 for long 
pre-r allophones of 6 and 4, and 4 for the cor- 
responding short vowels. Thus 0, is pronoun- 
ced[{u:], 4, [0:], u, [#%:] and u, [0]. Only those 
vowels that the subjects could keep apart 
from other vowels in sustained form were 
included. Analysis performed by the sweep 
frequency method at LME, Stockholm 1947. 


These relations will be 
dealt with in further de- 


tail at a later stage of this 


article. 

These rules pertaining 
to the features of gravity, 
compactness, and flatness 
respectively always hold 
for a single speaker in a 
given context of equal 
stress and length and with 
the same surrounding 
sounds. These and other 
distinctive features have 
been discussed in greater 
length elsewhere.** It 
should be observed that 
one and the same feature 
can be formulated slightly 
differently and still serve 
the same purpose. It is, 
for instance, generally 
sufficient to refer to the 
higher F, as the acoustic 
correlate of a more front- 
ed position. 

The phonetic value of 
front vowels is not very 
well represented by the 
F, versus F, plot. F; is 
considerably higher in [i] 
than in [y] and [e]. There 
is also a small contribu- 
tion from F, to be taken 


into account. A large increase in F, has the effect of ‘flattening’ the 
phonetic quality.*?2 We are in need of a formula in which to insert 
the primary data from the measurements of the frequencies of 
the first three formants and the fundamental pitch. The formula 
should give us two new variables which serve the same function 


as F, and F,. Such a graph will, however, still be an approximation 
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since the perception of vowels is not strictly two-dimensional. Three- 
dimensional graphs® are of course more exact, but it is questionable 
whether the gain stands in proportion to the representational cost. 

It has been shown from experiments with synthetic speech at 
Haskins Laboratories*! and in Sweden that vowels of the type 
{uodaa], where F, comes close to F,, can be simulated by a single 
formant representing the average of F, and F, in [9 a a] and of a 
position closer to F, for [u 0]: A better quality is obtained if two 
formants are utilized, one for F, and one for F,. In front vowels 
two formants are needed, and the higher of these should be placed 
closer to F, to simulate [i]. The parameter 


F,-F, 


F,’ =F, +4 (F; — F) (2) 
could be used as an approximate measure of the effective pitch of 
the higher formant group. It can be seen that when F, comes very 
close to F, the parameter F,’ constitutes a frequency location 
halfway between F, and F, and that F,’ is very close to F, if 
F,-—F, is very small. It is also more representative for hearing 
to convert the frequency positions of the formants into mels. 


Table 1. 
Conversion of frequency in cycles per second to pitch in mels 
Data from Beranek® 


Frequency Pitch 
c/s mels 
14000 3250 
9000 3000 
6600 2750 
5100 2500 
4000 2250 
3120 2000 
2450 1750 
1900 1500 
1420 1250 
1000 1000 


st P, Delattre, A. M. Liberman, F. S. Cooper, L. J. Gerstman, ‘An 
Experimental Study of the Acoustic Determinants of Vowel Color’, Word 8, 
1952, 195—210. 

62 LT. L. Beranek, Acoustic Measurements, New York 1949. 
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Table 1. (Cont.) 


Frequency Pitch 
c/s mels 
670 750 
500 
160 250 
20 0 


Thus M, is substituted for F 2, M, for F,, and so on. This transla- 
tion is not necessary but has some advantages since perceptually 
equal steps in phonetic quality seem to be more closely related to equal 
steps on the mel scale than to constant frequency intervals. 

The tendency of male-female overlapping of data for front 
vowels can be avoided by normalizing all formant positions by 
means of a scale factor k, which is the average position M,, of 
the speaker’s third formant divided by the average Mz, for all 
speakers. The parameters M,/k, and M,’/k, have thus been utilized 
for the vowel diagram of Fig. 3 in which each of the analysed vowels 
is represented by 4 points, one for the average male data, one for 
the average female data, and in addition one male speaker and one 
female speaker. The general relations between the vowels are essen- 
tially the same as in an F, versus F, diagram, but the overlapping 
is reduced and there is a clearer separation between the separate 
front vowels. 

Adopting the letter symbols of Swedish orthography as phoneme 
symbols the observed relations can be expressed in the following 
simplified form. 


Long vowels Short vowels 
1) ay Ig ag 
Us 
ay Og ag 


This system pertains to the maximum number of possible dis- 
tinctions. The /e,/—/a,/ distinction is often lost. The length distinc- 
tion is combined with quality differences, essentially of the tense- 
lax type discussed in ref.*8. It should be noted that /u,/ pronounced 
(%#] and /u,/ pronounced [o] are very different. The former is clearly 
a front vowel and the latter is a centralized back vowel in Stock- 
holm pronunciation. 
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The most economical plan for instructing a phoneme detector 
to identify the various phonemes is as follows: First separate the 
phonemes /o a a/, both short and long, from the rest of the system 
by the criterion of F,’~F, or M,’—M, being smaller than a 


MELS T T 
MALE GROUP 1,00 
= FEMALE GROUP 4,09 
\ SUBJ k,* 1,06 
a, 
Ye 
1800 aN 
3 
u 
%y, 
1300 
1200 U2 be 


ad 


800 
700 
200 x0 400 600 700 800 900 MEL 
ks 


Fig. 3. Swedish vowels, see further text of Fig. 2, presented in a diagram 
based on the normalized position of the first formant M,/k, and the nor- 
malized position of the effective second formant M,’/k, on the mel scale. The 
parameter M,’ comes close to M, in back vowels and close to % (M,+ M;) 
in front vowels, see further the text. Average values for the group of 
male subjects and for the female subjects as well as data for two speakers 
are plotted in the figure. The correction factor k, pertains to the ratio of 
the average position of the subjects’ third formant on the mel scale to the 
average for the group of male speakers. This correction has eliminated a great 
part of the male/female differences, especially in front vowels. Front vowels 
have consistently higher M,’ — M, than back vowels, and rounded (flat) 
front vowels have consistently lower M,’ + M, than the unrounded (plain) 
front vowels. 


1100 m- 
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critical value. These are the grave vowels. Then the unrounded 
front vowels, i.e. the plain acute phonemes, /i e 4/ are separated 
from the rest of the system by means of the criterion that F,’ + 
F, or M,’+M, shall exceed a certain critical value. Next, as 
suggested by Malmberg,®* the phoneme /u/ can be separated from 
/y/ and /6/ as being even more flat, i.e. by means of an even smaller 
F,' + F, or M,’+M,. In each of the three remaining groups, 
lo a a/, /y 6/, and /i e 4/, the compactness criterion is applied, i.e. 
a subdivision is performed on the basis of the frequency of the first 
formant, M, or F,, which increases with the larger articulatory open- 
ing. Finally, the relative length of the phoneme relative to the 
length of the following consonant and with due regard to the speak- 
ing tempo and also to the remaining quality differences is evalu- 
ated for the tense-lax distinction.** The order of these operations 
is not crucial except that the two last operations should be made 
simultaneously and sometimes preferably in reversed order. This 
analysis pertains to stressed syllables only. 

As an alternative to the third degree of flatness, proposed by 
Malmberg,®* /u/ can be regarded as being acute relative to /o 
but grave compared to /y/, i.e. plus minus acute. This latter solu- 
tion is simplest when discussing dialectal loss of the /u/ — /o/ 
distinction and for opposing /u,/ to /a,/. The coding procedure 
has an equal cost in both cases. The acoustic flattening criterion 
as defined above also applies to the relations between /o a a/, as 
could be expected. 

There are a multitude of mathematical operations that can be 
performed on the primary material of formant frequency data for 
constructing vowel diagrams. The system adopted in Fig. 4 is not 
proposed to have any marked advantages, primarily because it is 
too complicated for common use. The two variables are the product 
of the normalized mel positions of the three first formants and the 
ratio of the geometrical mean position of the second and third 
formants to the position of the first formant. It can be seen 
that the relations between phonemes stay essentially the same as 
in other vowel diagrams. The female-male differences are much 
reduced and the voiced consonants /v j 1/ are clearly outside the 
vowel frame. 

63 B. Malmberg, ‘Distinctive Features of Swedish Vowels. Some In- 


strumental and Structural Data’, For Roman Jakobson, ’S-Gravenhage 
1956, 316—321. 
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=600 MELS MALE VOICES = 1.0 
REF. VOICE = 1,02 


FEMALE VOICES = 1,09 


My =1780 MELS 


Fig. 4. Swedish vowels and some voiced consonants, see further text of 
Fig. 2, arranged in a two-dimensional figure according to the product of 
the normalized mel values of the three first formants as one parameter and 
the ratio of the geometrical mean of the second and third formant mel posi- 
tions to that of the first formant as the second parameter. The average 
data for male speakers and for female speakers and the reference male 
subject are plotted separately. The reason the consonant [r] comes inside 
the vowel diagram, and not outside like [v j lJ, is that the open phase 
of the trill and not the closure phase was measured. 


2.3 Cavity-Formant-Intensity Relations 

The present standpoint of cavity formant theories, largely ori- 
ginating from H. K. Dunn** and studies at M. I. T.** including the 
important contributions of Stevens and House,**-*? has been re- 
viewed by Fischer—Jgrgensen.! It remains here to summarize and add 
some results from recent investigations.®4-® 

(1) The classical method of describing tongue articulation by 


8 C. G. M. Fant, Acoustic Theory of Speech Production, to be publ. by 
Mouton and Co., ’S-Gravenhage. 

65 C. G. M. Fant, ‘Den akustiska fonetikens grunder’, Royal Inst. of 
Technology, Div. of Telegraphy-Telephony, Report No. 7, 1957. 
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the position of the highest point of the tongue serves a pedagogical 
purpose only. An articulatory diagram of vowels based on the 
measurement of this reference point resembles the F, F, diagram. 
However, this is only a coincidence due to the close correlation 
between this point and the major movements of the mass of the 
tongue. 

(2) The highest point of the tongue has in itself no acoustic 
significance except in very high vowels where it comes close to the 
position of the major constriction separating a back cavity from a 
front cavity. The role of the pharynx has been underestimated. 
Back vowels of type [a 9] have a pharyngeal narrowing which is 
their acoustically relevant point of articulation. The vowel [a] is 
the articulatory extreme to [i]. Both have approximately the same 
cross-sectional area at the place of maximum narrowing. In terms 
of place of articulation [i] and [a] are opposed as palatal to pha- 
ryngeal. Even the maximally compact vowel [a] can be regarded 
as a back vowel from an articulatory point of view, in conformity 
with the possible acoustic grouping of [a a 9 o u| in terms of small 
F, - F,. 

(3) In general, every part of the vocal tract contributes 
somewhat to the tuning of all formants. In all back vowels both 
front and back cavities influence substantially the frequency of 
both F, and F,. The second formant is mainly dependent on the 
front cavity in high mid vowels only and is a Helmholtz resonance 
only when some lip-rounding is superimposed. In open and half open 
front vowels F, and I; are fairly equally dependent on all parts 
of the vocal tract. In front varieties of the vowel [i] the second 
formant is a half wave-length standing wave resonance in the 
pharynx, the frequency of which is inversely proportional to the 
pharynx length 1,, ie. F, = c/2 1,, and the third formant is a 
mouth resonance. The more advanced the tongue position the more 
definite is this affiliation. In a Swedish or Russian [i] F, is practi- 
cally independent of the mouth cavity. It should thus be observed 
that contrary to the classical theory, the mouth resonator can influ- 
ence F, more than F, as in a vowel [u] produced with retracted 
tongue position and very narrow lip-opening, and it can be mainly 
responsible for I, as in [i]. 

(4) The uncoupled mouth resonance can be interpreted more 
simply on the perception plane since it tends to coincide with 
the effective pitch referring to the centre of the 1 F2 group in 
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vowels where these formants come close, that is, perceptually ‘one 
formant sounds’, and to the effective centre of the F2 F3 F4 group 
in vowels, where F, and F, are far apart. The effective pitch here 
means the frequency of the upper formant in a synthetic two- 
formant sound that phonetically matches the vowel. 

(5) The flattening effect of lip-rounding is greatest on those 
formants that are most dependent on the mouth cavity. The rising 
transition from a labial to the first part of a following vowel is thus 
essentially contained in F, of [u], F, of [a], and F; of [i], in short to 
the formant most closely associated with the effective timbre pitch. 

(6) Two formants can never coincide unless the vocal tract 
is completely blocked at some place inside the lips. During these 
conditions very little sound will be transmitted from a vocal cord 
source, and the formants have significance mainly as resonance 
frequencies. There is, however, some small escape of sound through 
the cavity walls as in a voiced occlusion of a stop sound. 

(7) When the tongue is raised against the palate, starting from 
the neutral position characterized by no appreciable tongue narrow- 
ing, the formant pattern will change from that of a regular spac- 
ing with 1000 c/s intervals between adjacent formants to a pat- 
tern characterized by a low F, and a high F, closer to F;. There 
will also be an increase in F, if the tongue has moved to a front 
palatal position. When the tongue is in a retracted position, the 
back of the tongue approaching the pharynx wall, there will be a 
rise in F, and a lowering of F, which tends to cause an F, F, 
proximity. Not until the tongue comes very close to the pharynx 
wall will F, be lowered again. 

At complete obstruction of the air passage at any point of the 
vocal tract, F, approaches but does not quite reach zero frequency. 
The midpalatal place of articulation providing minimum F, Fy, 
distance is also that of maximum Fy. As this place is passed in a 
forward movement of the tongue the dependency of F, and F,; on 
front and back cavities changes from an F, front cavity to an F, 
back cavity affiliation.4?7 The dependency is equal at the point of 
maximum F,. Similarly F, and F, change cavity dependency at 
the pharyngeal region of max. F,, min. F,. The limiting formant 
pattern for a uvular obstruction is characterized by an F, of inter- 
mediate position and a moderately high F;. A shift of the place of 
articulation in advance of the F, F, proximity point in the mid- 
palatal region will cause a shift of F, down from the maximum 
value 2400 c/s in a normal male voice combined with an increase 
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in F, until at the dental point of articulation F, has been lowered 
to a position of the order of 1750 c/s with F,; remaining high at 
approximately 2900 c/s. A retroflex articulation will always be 
combined with a low position of F3. 

Delabialization alone can never cause a lowering of any of the 
four first formants, but will give rise to a greater or smaller increase 
in all of them. The influence of the degree of lip-rounding is always 
small on those formants that belong to cavities behind the tongue 
constriction as long as the tongue constriction area is smaller than 
the lip-opening. 

These rules should be kept in mind when studying the formant 
transitions from consonants to vowels and vice versa. 

It is not possible for a speaker to produce a change in the inten- 
sity level of a formant by means of an articulatory effort alone, 
without changing the general pattern of formant frequencies. An 
increase in voice effort will, however, tend to raise the level of for- 
mants in proportion to their frequency positions. The overall slope 
of the spectrum is very different comparing a vowel sampled from 
faint voiced speech and a vowel sampled from a loud shout. At high 
voice efforts the intensity level of the voice fundamental will be 
small compared to the intensity level of the formants, but the 
reverse relation is found in faint speech.?3-*4 In high-pitched female 
speech the voice fundamental also tends to be relatively dominating. 

The rules for changes in formant intensity levels conditioned 
by changes in the frequency location of one or more formants with- 
in the spectrum are not very difficult to learn.** A shift down in 
frequency of any formant, e.g. the first formant, will cause a 
drop in level of all parts of the spectrum above the formant and 
with the same amount at all places, at valleys as well as at peaks. 
The exact amount is 12 dB for each halving of the formant fre- 
quency. This rule explains why the third and higher formants 
of [u] are so weak that they sometimes are not observed on 
the sonagram. If both F, and F, are shifted down an octave, 
e.g. from F, = 600c/s, F, = 1100 c/s in [a] to Fy = 300 c/s, 
I*, = 550 c/s in [{u], the apparent result will be a decrease in inten- 
sity level of the higher part of the spectrum by 24 dB. This analyt- 
ical predictability checks well with experimental results.*® When 
two formants approach each other there will be a summation 


86 C. G. M. Fant, ‘On the Predictability of Formant Levels and Spec- 
trum Envelope from Formant Frequencies’, For Roman Jakobson, ’S-Gra- 
venhage 1956, 109—120. 
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effect so that the intensity of both increases in inverse proportion 
to their distance on the frequency scale. Thus the high position of 
F, in acute vowels shifts the balance over to the higher part of the 
spectrum. This is why [e] from a perceptual point of view is a two- 
formant sound and [o] a one-formant sound. 

With increased articulatory narrowing in the mouth cavity 
causing a lower frequency F;,, not only will the level of the spectrum 
above F, be reduced, but also L, will decay somewhat due to the 
increased damping. The extreme narrowing occurs in a voiced stop 
gap preceding the explosion of [b], [d], or [g]. This ‘voice bar’ is 
the first formant. The higher formants can, however, be detected 
in a section under favourable circumstances. There is a gradual 
scale of opening and thus intensity increase, comparing this pre- 
explosion sound with a voiced continuant, a close vowel, and an 
open vowel of the same tongue articulation. These relations throw 
light on the sonority scale since increased opening means higher 
F, and greater overall intensity and loudness. They are also of 
some import for discussing the intensity variations within a voiced 
consonant syllable. The lower frequency position of F, of a voiced 
consonant compared to F, of an adjacent vowel determines the 
relatively lower overall intensity of the voiced part of the conso- 
nant. In laterals there is an additional anti-resonance that contrib- 
utes to the weakening of L,. Nasal consonants are weakened due 
to similar effects and large formant damping. Increased stress on, 
for example, [r v j] is combined with a more complete articulatory 
closure resulting in a reduced intensity compared to a previous or 
a following vowel. Further, the noise elements within the consonant 
increase in intensity, especially if the narrowing has proceeded so 
far that the voice part is filtered out effectively. On the other hand 
a voiced stop in unstressed intervocalic position may be articulated 
with incomplete closure and thus incomplete intensity reduction 
of the closure phase so that F, and F, may show up on the spectro- 
gram, see Fig. 15, the word flyger. 


2.4 Acoustic Segmentation and Description of Consonants 


Speech is apparently a combination of continuous and discrete 
characteristics. The movements of the articulators are reflected 
by continuous pattern changes in the spectrogram. In addition 
there are rapid changes of intensity and composition due to the 
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switching off or on of different sound sources, voiced or unvoiced, 
and due to the movements of the tongue or the lips to and especially 
from a position of complete closure in the median pathway of the 
vocal tract. Articulatory phonetics has emphasized the continuous 
aspects as seen from physiological recordings®** and X-ray film.®’ 
It is one of the great advantages of the spectrograph that the 
discontinuities and thus the physical boundaries between speech 
sounds or parts of speech sounds can be studied. A broad band filter 
analysis is optimal in this respect, displaying the regular, periodic 
structure of voiced sounds and the random striations of unvoiced 
sounds. Stops, affricates, and continuants are easily distinguished, 
and the characteristics of nasal consonants, nasalized vowels, 
laterals, and r-sounds can also be learned with some training, see 
further reference.* 

On the basis of the observable acoustic boundaries in a spectro- 
gram the speech wave can be divided into a succession of natural 
sound units, referred to as sound segments or sound intervals. 
Some of these will be recognized as sub-units within larger intervals 
of the dimension speech sound. For the traditional method of de- 
fining the duration of speech sounds no overlapping is allowed for 
in order that the total length of the utterance shall be the sum of 
the parts. Accepting this as a primary procedure, it must be recog- 
nized that a phoneme is generally identified on the basis of its 
traditional sound intervals plus the modification caused in sound 
intervals of adjacent phonemes. These stationary and transitional 
cues are in general not independently commutable units from a 
perceptual point of view — at least not in stop sounds — since the 
auditory impression will be based on the combined stimulus and 
not on each separately. They can be lumped together in a descrip- 
tion of the inherent distinctive features** of a phoneme but should 
be measured and stated separately in speech analysis. 

A few examples may illustrate the segmentation problem. An 
unvoiced aspirated stop sound may be composed of a maximum 
of 4 segments: stop gap + explosion + frication + aspiration. 
In addition, a preceding and a following vowel may contribute 
to the listener’s identification. The distinction between explosion 
and frication is a matter of source, explosion being the sound pro- 
duced by the shock excitation of the vocal cavities due to the 


8? J. Lotz, ‘The Structure of Human Speech’, Transactions of the New 
York Academy of Sciences, Ser. II, 16, No. 7, 1954, 373—384. 
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pressure release and frication originating from turbulent sound 
produced by the following flow of air through a narrow passage. As 
long as this passage is very narrow, it is essentially this passage plus 
the anterior cavities that contribute to shape the sound spectrum. 

The formant structure is very different from that of vowels. 
No trace is generally seen of F1, and the regions of spectral maxima 
are much broader and wider spread than in vowels. The only arti- 
culatory condition for the occurrence of a concentration of energy 
resulting in a single major formant is that there must be a cavity 
of appreciable size in front of the source, e.g. as in palatals or velars. 
This formant will have continuity with the mouth resonance of 
the following vowel as previously defined. If there is no appre- 
ciable cavity in front of the source, the sound energy will be more 
evenly spread in the spectrum. The high frequency formant region 
of dentals is shaped by the narrow predorso-alveolar channel, and 
the more even spread of energy of the labials is due to the absence 
of any shaping cavity. 

The duration of the explosion phase is limited by the decay time 
of the vocal cavities participating in the vibration. This time is 
the inverse of the bandwidth of the major resonance excited by the 
explosion. It is generally shorter than 15 milliseconds. The explosion 
phase and the frication phase have similar spectral composition 
except that the frication phase generally has more high frequency 
emphasis. 

The aspiration segment in its extreme form is merely an unvoiced 
version of the following vowel.3? It has the same random noise 
fine structure as unvoiced sounds in general, but the appearance 
of the formants F2 F3 F4 indicates that the whole vocal tract parti- 
cipates in the shaping of the spectrum. F2 is generally too weak to 
be detected since it is damped out by the open glottis. The distinc- 
tion between frication and aspiration is thus essentially a matter 
of resonator system, the greater opening of the aspiratory interval 
being the necessary prerequisite for the cavities behind the source 
to participate effectively. 

In the corresponding lenis or voiced stop the onset of voicing 
starts sooner after the explosion. Normally the aspiratory interval 
is lacking and the fricative interval is shortened. The movements 
of the articulators away from the state of closure, as reflected by 
the formant bendings in the spectrogram, are now confined to a 
voiced segment, i.e. the beginning of the following vowel.*? A con- 
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siderable part of these transitions belongs to the aspiration phase 
of the corresponding unvoiced stop sound, and F,, signalling the 
degree of opening, has thus reached a higher frequency position at 
the onset of voicing after the aspiration. 

From an articulatory point of view the vowel starts where the 
fricative segment ends even if the onset of voicing is delayed due 
to aspiration. Other instances of assimilated voicelessness can be 
observed in spectrograms, for instance in the whole or the first 
part of [l r j v} following an ‘unvoiced conscnant’ (e.g. Swedish 
tre ‘three’). 

The shift from a frication interval to an aspiration interval is 
often gradual and seldom complete so that a simultaneous mixture 
of the two is freauently found at the end of any unvoiced consonant, 
not only stops. 

Sonagraph spectrograms of Russian syllables [za z;a sa s,a 
ta t;a ca Ca] are shown in Fig. 5 together with sections and intensity 
curves recorded on a Mingograph. A number of 6 separate Mingo- 
graph curves are assembled above each spectrogram. The top curve 
is a normal oscillogram limited in frequency to O0—800 c/s, 
which is the range covered by the Mingograph. 

The second curve from the top-is the overall intensity versus 
time curve. A prefiltering, in the form of a base reduction corre- 
sponding to the standardized A-curve of sound level meters, was 
performed ahead of the rectification and smoothing process in the 
intensity measuring device. See further section 3. The A-curve 
pre-emphasis corresponds to a 40-phone equal loudness contour and 
causes an attenuation of the lower frequencies. A very small degree 
of base cut corresponding to the B-filter curve of sound level meters 
was adopted for curve 1. Curves 3, 4, 5, and 6 show the temporal 
variations of speech intensity within certain broader frequency 
bands. These are highpass 1500 c/s, bandpass 1400—1800 c/s, 
bandpass 2800—3600 c/s, and highpass 4000 c/s respectively. 
The consonant sections placed on each side of the spectrograms were 
assembled from bandpass intensity oscillograms utilizing a 150 c/s 
wide analysing filter.* 


* The technical processing of the data was started at the Massachu- 
setts Institute of Technology in cooperation with M. Halle and G. Hughes. 
The data in Fig. 5 and Fig. 6 will be utilized to illustrate the relations be- 
tween articulation and speech wave in a coming publication.** The Russian 
consonants have been analysed by Halle.28 This subject will be treated in 
greater detail in a coming publication.® 
688 M. Halle, The Russian Consonants, Mouton and Co., ’S-Gravenhage. 
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The time positions within the spectrograms from where the 
section samples originate have been indicated. Contrary to the 
low intensity resolution of the spectrogram the available intensity 
range of the sections is large enough to display F2 and F3 within 
the consonant. Observe the gradual onset of noise energy in the 
high frequency region within the [z] and [z;], [s] and [s;] segments 
and the simultaneous decrease and weakening of the ‘voice bar’, 
i. e. F, of [z] and [z,]. These characteristics reflect increasing artic- 
ulatory closure combined with increasing lung pressure in the 
onset of the syllable. The regularly spaced voicing striations are 
seen in the high frequency region. This periodic pattern indicates 
not a mixture of voice and noise but of noise being intensity modu- 
lated by the periodic air flow variations.® 

The affricates are of shorter duration and the intensity peak is 
shifted towards the middle of the fricative segment. Observe the 
centrally placed single formant of the compact affricate [¢] and 
the high frequency formant region of [c] similar to [s]. 

The sharp-plain distinction between [z;] — [z], [s;] — [s], and 
[t;] — (t], ie. the palatalization, shows up in the higher F,-position 
of the vowel following the sharp stop. There is also a marked tenden- 
cy towards affrication of the sharp stops as seen from the promi- 
nent fricative sound segment of [t;a]. The explosion phase is of 
insignificant low duration and intensity in all these stops and affri- 
cates. In [ta] only a short fricative phase can be measured, and in 
[t;a] all three phases have been measured. The frication intensity 
is highest and in the following phase, marked III, comprising mixed 
frication and aspiration, there are traces of F3 F4 F5 Fo. 

When statistical studies of consonant spectra are attempted, 
it becomes necessary to summarize the spectral data by a specifi- 
cation of intensity and frequency of the most important formants 
or formant regions found in each of the segments of interest. 

When performing detailed studies of this sort it is best to 
supplement tabulations with pictures of typical spectral sections. 
If average data shall have any significance it is of course necessary 
that the contextual frame be the same in all samples. There is, 
however, a basic difficulty of a descriptive nature in identifying 
and labelling for further reference the various formants to be 

69 W. Meyer-Eppler, ‘Untersuchungen zur Schallstruktur der stimm- 


haften und stimmlosen Gerauschlaute’, Zeitschrift fiir Phonetik 7, No. 1/2, 
1953, 89—104. 
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measured. Not very much mapping of this kind** has been done 
since the investigator easily drowns in a sea of details. The major 
difficulty is to recognize one and the same pattern detail in the 
spectra from various voices. A formant may for instance be too weak 
to be observed or merge with adjacent formants. If the investiga- 
tor is not very well acquainted with possible pattern variations 
of the visible sound substance, he runs the risk of making errors 
in the labelling of data which will invalidate the statistics. It is 
safer to discuss in detail a few samples and wait with the statistics 
until a reliable specificational frame has been established and can 
be mastered. Meanwhile, the pattern aspects can be discussed and 
simple hypotheses**-* 7° of how a phoneme recognizing machine 
should be instructed to perform an identification can be tested, 
for instance on the basis of experimental data from a few, very broad 
bandpass filters covering the speech spectrum as exemplified by 
the Mingograph curves of Fig. 5. 

The spectral characteristics of successive sound segments provide 
information both on the manner of production of a sound and on the 
place and configuration of the articulators within each interval. 
A complete separation of these factors is not always possible to 
obtain, and it is not very easy to infer the position of the articula- 
tors in the absence of a voice source, supplying the necessary energy 
to make F1, F2, F3, etc., visible on the spectrogram. The term F- 
pattern is adopted here for the composite data of the frequencies 
F,, F,, Fs, etc. Each of these frequencies will alternatively be re- 
ferred to as formant positions, F,-position, F,-position, etc. The 
F-pattern is apparently a fairly close physical correlate to the vocal 
tract configuration including both tongue and lip articulation. A 
transition is defined by the temporal variations of the F-pattern 
from one sound interval to the next. The term F,-position is iden- 
tical with the Potter, Kopp, and Green ‘hub’,’ defined as the visual 
or hidden position of F, within a consonant, but it is more general 
since it applies to any category of speech sound. The term ‘locus’ 
adopted by the Haskins Group has not been used by them in this 
strict meaning.”’ 

However, the terminology F;,-locus, F,-locus, etc. has been 
used by others*’ in the same sense as the.‘hub’ but extended to in- 
clude all formants of interest for the study of transitions. 


70 J. Wiren, H. L. Stubbs, ‘Electronic Binary Selection System for 
Phoneme Classification’, J. Acoust. Soc. Am. 28, 1956, 1082—1091. 
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In a detailed study the F-pattern should be measured as a con- 
tinuous time varying function within the whole utterance to be ana- 
lysed. The F-pattern of the phase of maximum closure in a consonant 
is generally sufficient for deriving its transitional characteristics 
in combination with other known sounds. When the position of a 
formant is lower in a consonant than in a following vowel the tran- 
sition is said to be rising. The F-positions within an unvoiced sound 
must often be estimated by an extrapolation process. It should any- 
how be stated in speech analysis whether data on ‘loci’ or F-posi- 
tions refer to the F-pattern at the instance where voicing sets in 
or from an estimate of the F-pattern at the interval of maximum 
closure or if the term refers to a theory of speech perception inclu- 
ding all physical stimuli in the speech wave. 

The transitional characteristics of Russian consonants are shown 
in Fig. 6. These diagrams were traced from spectrograms of open 
syllables composed of consonant plus the vowel [a]. The full extent 
of F, is shown for the voiced consonants. The other formants are 
traced within the vocalic interval only, starting from the visible trace 
after the articulators have started moving away from the closure. 

It can be seen that all consonants of the same tongue and lip 
articulation, for instance the palatalized labials, are associated with 
the same transitional patterns. The first part of the transition is 
mainly due to the opening of the lips. The rising F, and the fairly 
neutral F, transition of this delabialization phase reflects the F; 
mouth cavity, F, pharynx cavity dependency of palatal sounds 
discussed earlier. The next part of the F, transition back to a lower 
value is associated with the shift of the tongue back from the palatal 
position to the [a]-position. In the unpalatalized labial consonants 
the delabialization shows up in the rising F, and in the especially 
steeply rising F, transition. The tendency of F, F, proximity in 
the F-pattern of [k] and [g] before [a] is a typical effect. The high 
F,-position of dentals is seen in all instances except in case of 
[na], where the assimilated nasality in the first part of the vowel 
makes F, partly dependent on the nasal cavities. The, higher start- 
ing point of the F, transition after an unvoiced consonant com- 
pared to the corresponding voiced consonant is apparent. 

The pairwise presentation of transitions associated: with voiced 
and unvoiced consonants also makes possible studies of how ra- 
pidly after explosion the articulators move away from the state 
of closure. The zero position of the time scale in each diagram 
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corresponds to the instance of explosion or to the beginning of 
the opening process of the voiced member. This time reference 
is also indicated with a dotted arrow. The starting point for the 
unvoiced member is indicated with a solid line arrow. It can be 
seen that with the criteria of best fit of the superimposed formant 
patterns in the interval of time 0.1 to 0.2 seconds after the release, 
the starting points are the same for the labial stops, indicating 
equal speed of the articulatory movements. This is also true of 
[g] and[(k]. Inthe unpalatalized dentals [d] and [t] there is apparently 
a more rapid transition for the unvoiced member, probably due to 
a faster jaw movement. The typical but redundant frication seg- 
ment of the palatalized dentals occupies a longer time interval 
in [t;) than in 

A method of graphical presentation of the essentials of spectro- 
graphic patterns including both F-patterns and the spectral compo- 
sition of consonants is shown in Fig. 7 pertaining to some major 
allophones of Swedish phonemes. 

In a more extensive presentation of data all possible positional 
variants, even with regard to stress, should be included. Consonants 
produced with the same lip and tongue articulation can of course 
be represented by a single F-pattern for each vowel to be considered. 

When attempting physiological interpretations of the spectro- 
graphic data the separate effects of a primary articulation and of 
co-articulation should be observed. These two determinants are 
partially overlaid. If the tongue position is optimally adjusted, the 
co-articulation of palato-velars is considerable, since both the front 
cavity volume and lip-rounding vary according to the associated 
vowel. The main formant of [k] or [g] can accordingly vary from 500 
c/s to 3500 c/s. Lip-rounding alone is responsible for a greater part 
of this variation. The common element is the single formant struc- 
ture and the neutral transition from this formant to the first part 
of the preceding sound after the articulatory release. Labial con- 
sonants, on the other hand, have a sharply rising transition in the 
first interval of delabialization. 

The F-patterns of labials vary of course with the particular 
position of the tongue, which may take any position not necessarily 
close to that of a following vowel. It can thus happen that the F,- 
locus of a labial consonant is higher than that of the stationary 
interval of the following vowel.* This is the case when the tongue 


* As for instance in Danish*? and in Swedish**. 
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Fig. 7. Schematized assembly of the spectrographic characteristics of 
Swedish vowels and consonants. A mel scale approximation has been uti- 
lized for the frequency scale. Consonants are characterized by their spectral 
energy distribution and their F-patterns, vowels by F-patterns alone. 


lies in a flat neutral position in the labial consonant and the follow- 
ing vowel is [o] or [u]. The first part of the transition contains a ris- 
ing component due to delabialization and a falling component 
due to the tongue shift. These may add up in various ways. The 
effect of the rising component is generally restricted to the very 
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first part of the transition but may be too weak to override the 
falling component. 

The fixed tongue position of dentals will condition a fixed F- 
pattern. Superimposed lip-rounding will be effective when the 
tongue has moved away so far from the dental contact that the 
dental passage no longer is small compared to the lip-opening. This 
instance in time may, however, coincide with the apparent starting 
point of the transition. Here the F,-position may be lowered due 
to the labialization. 

The spectral composition of the frication segment of a stop, 
fricative, or affricate is influenced by a superimposed lip-rounding 
which has the effect of shifting the centre of gravity of the spectral 
energy to a lower frequency and to cause a more apparent concen- 
tration of the spectral energy. This effect does not, however, inter- 
fere with the relations between the phonemes of different places 
of articulation. 

A summary or critical review of the various theories, or rather 
the various formulations of the distinctive features of consonants, 
proposed by different research groups will not be attempted here. 
Readers are referred to Eli Fischer-Jgrgensen’s review,! some recent 
original articles, 2% and to other discussions to come.* 


3. New Instruments for Speech Analysis and Synthesis™ 


3.1. The 48-Channel Spectrograph 


The Sonagraph, if correctly handled and in good shape, is an 
excellent tool for displaying the time-frequency-intensity compo- 
sition of speech. When performing analysis of large quantities of 
speech there is, however, a need for an instrument of higher analys- 
ing capacity that provides reasonably good pictures at low cost 
and with short processing time. A spectrograph that fulfils these 
requirements has been designed by H. Sund” at the R.I.T. in Stock- 
holm. This is a direct display instrument intended for the recording 
of time-frequency-intensity spectrograms on continuously moving 


71 An extensive technical review of various spectrum analysis methods 
is given in W. Meyer-Eppler, ‘Die Spektralanalyse der Sprache’, Zeitschrift 
fiir Phonetik 4, 1950, 241—252 and 328—364. 

78 H. Sund, ‘A Sound Spectrometer for Speech Analysis’, Transactions 
of the R. I. T., No. 112, Stockholm 1957. More recent contributions to the 
development of recording techniques have been made by A. Risberg. 
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35 millimeter film and for the alternative or simultaneous record- 
ing of a succession of frequency-intensity ‘sections’ frame after 
frame on 16 or 35 millimeter film. The spectrograph proper is 
composed of 48 high quality bandpass filters which are scanned by 
an electronic switch at a rate of 500 times per second. The intensity 
or rather signal amplitude in each filter is represented by the height 
of a corresponding vertical line on the screen of a cathode ray tube 
where the separate filters are ordered in a horizontal row. This is 
the intensity versus frequency display providing the ‘sections’. 
On a separate cathode ray tube the filter channels are ordered as 
points in a vertical row, and the intensity of the light beam at each 
point is modulated by the signal amplitude of the corresponding 
filter. The time axis has to be supplied by the continuously moving 
film of a camera attached to the oscilloscope. 

A film speed of 5 cm/s is utilized. A special time mark signal 
recurring with intervals of 1/5 second is displayed on the top of 
the spectrographic picture. All filters up to a centre frequency 
of 4000 c/s except the first one have a bandwidth of 300 c/s, and 
successively broader filters are used in the higher frequency region. 
Thus filter No. 48 covers the frequency region of 9400-10000 c/s, 
and filter No. 1 the interval 0-200 c/s. Up to 1000 c/s there is an 
overlap by a factor of 4, which means that there is a distance of 
75 c/s between the centre frequencies of adjacent filters. Between 
1000-2000 c/s the filters are arranged with 3 times overlap, i.e. 
with 100 c/s frequency intervals. Up to 3600 c/s this interval 
is 150 c/s. Above 4000 c/s there is no overlap. By this arrangement 
the centre frequencies of the filters are approximately distributed 
on the mel scale simulating the frequency to place conversion of 
the auditory mechanism. There are more filters (and thus more 
space on the spectrograms) devoted to the low and medium fre- 
quency region than to the higher frequency region. This is an 
advantage compared to the Sonagraph which has a linear frequency 
display. 

Two sentences of American speech have been analysed by both 
the Sonagraph, Fig. 8, and by the 48-channel analyser, Fig. 9. The 
frequency resolution is somewhat better on the Sonagraph, except 
in the F,-region, but the aspect ratio of frequency to time is better 
in the pictures taken with the 48-channel recorder. Horizontal lines 
normally transversing a picture in pauses and weak parts reflect the 
vertical ordering of the separate filter channels and thus constitute 
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Fig. 8. Sonagraph broad band spectrographic display of the sentences 
‘Which police first caught the wolf champing rotten zebra bait near my 
goathouse? Yes, judges do treasure very thin soiled T-shirts.” Text and 
transcription H. M. Truby. Above sample contains all the phonemes of 
GA in their primary positions plus some other alternate positionings. 
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a frequency calibration. This pattern is contained within the ini- 
tial exposure due to the automatic volume control that drives up 
the threshold marking level in the weak intervals. Every fifth 
channel has been given a slightly higher marking threshold. 

With the present ¢amera it is possible to expose 10 minutes of 
speech at a time, but other cameras with larger reels may take up 
to 30 minutes of sp¢ech at a time. The development takes of course 
a longer time, and larger quantities of film should be sent to a photo- 
graphic laboratory for routine processing. The recordings are gener- 
ally made on a paper base material providing a white back- 
ground and black patterns. Material to be published is recorded 
on ordinary transparent base film. 

This new spectrograph is the only direct display instrument 
known to give permanent records of a quality comparable to the 
sonagrams. It is valuable for large scale investigations.’ 

A succession of amplitude versus frequency sections covering 
the first one second of the utterance of Fig. 9 is shown in Fig. 10. 
These pictures were taken on the ‘amplitude scope’ with a 16 mm 
film camera operating at 64 frames a second. The frequency cali- 
bration is shown on the last picture of the series. Each section 
is a sample of approximately 10 milliseconds’ duration. The time 
calibration has to be based on a topographical comparison of 
sections and the spectrogram. Gating techniques can also be used 
for taking a single picture at a continuously variable instant 
of time. 

The amplitude scale has a 35 dB useful range, the upper part 
of which is logarithmic and the lower part linear. This is a fair 
compromise for spectrograms. A larger degree of compression would 
have resulted in a lack of contrast within the spectrum. When am- 
plitude sections are taken it may be necessary to perform the analysis 
twice with two different gain settings, one for weak sounds and one 
for more intense sounds. 

By halving the speed of the tape recorder (dividing all frequen- 
cies by a factor of 2) the apparent bandwidths of the filters will be 


73 An interesting technique for visible speech recording on a large 
closed loop 35 mm film based on the use of repeated playback over a wave 
analyser and employing lightwidth modulation of spectral.energy has been 
described by Edgardh. This instrument does not seem to have come to much 
use yet. B. H. Edgardh, ‘Der Tonfilmspektrograph’, IVA 22, No. 5, Stock- 
holm 1951. 
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doubled. This is of some value for avoiding the appearance of har- 
monic lines within the broad band spectra of high-pitched female 
voices. Similarly by increasing the speed of the tape playback 8 
times, the apparent bandwidths of the filters will be reduced by a 
factor of 8 to approximately 40 c/s, and the step between adjacent 
filters in the lower part of the spectrum is accordingly reduced from 
75 c/s to approximately 10 c/s. This provides a sufficiently good 
frequency resolution for fundamental pitch measurements, but 
only the first formant of the spectrum will be contained within 
the picture. Other compromises between frequency resolution and the 
effective upper frequency of analysis might be useful. 


3.2 The Mingograph. Accessories for Oscillographic Recordings 
3.21 Waveform Analysis 

The Mingograph” is a direct writing 4-channel oscillographic 
recorder supplied with AC and DC amplifiers for two of the chan- 
nels. It has a workable frequency range of 0-800 c/s, which is better 
than for most instruments of the direct writing type. The paper 
speed is variable in the following steps: 100-50-—20-10-5-2-1-0.5 
cm/s. The two higher speeds require an additional motor drive. 
Very cheap recording paper can be used and no development process 
is involved. The traces compare favourably with what can be 
obtained using any other oscillograph. A wider frequency range 
of recording than up to 800 c/s can be obtained by the above 
mentioned technique of reducing the playback speed of a tape 
recorder. A frequency division, i.e. time expansion, by a factor 
of 8 should be undertaken for recording oscillograms with a maxi- 
mum effective paper speed of 8m/s and an effective frequency 
response of 6400 c/s. A substantial base reduction will result in 
this operation if no special compensation of the tape recorder is 
made. On the other hand a certain amount of attenuation of the 
lowest frequency region is generally desirable for phonetic purposes. 

Waveform analysis of a fundamental period in the syllable 
{da} is exemplified in Fig. 11. Both the frequency of the first for- 

*4 Produced by AB Elema, Stockholm, Sweden. A direct writing ink 
recorder of a higher upper frequency limit than the Mingograph but not 
capable of producing equally fine traces has been designed by Lacerda: 
Goran Hammarstrém, ‘Le chromographe et le triangle tonométrique de 


Lacerda’, Revista do Laboratério de Fonética Experimental de Coimbra I, 
1952, 28—38. 
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Fig. 11. Waveform analysis of the fundamental pitch and of the fre- 
quency F, and bandwidth B, of the first formant in the fourth period of 
the vowel [9] in [do]. Mingograph high speed oscillogram, half speed tape 
recorder playback. 


mant F, and its bandwidth B, and the frequency of the voice fun- 
damental F, can be measured as indicated in Fig. 11. A formant 
in a frequency display of a sound always corresponds to a damped 
sine wave in the oscillogram. The period time of this oscillation, 
covering a positive and negative excursion of the curve, is the 
inverse of the formant frequency, just as the duration of a voice 
period is the inverse of the fundamental pitch. The decay of the 
envelope drawn to enclose smoothly the peaks of the damped oscil- 
lation is inversely related to the bandwidth of the formant. A rapid 
decay of the oscillation means a large bandwidth. The relative en- 
velope decay expressed as the ratio of the amplitudes A, and A, 
measured from the positive to the negative envelope at two arbi- 
trary instances of time T,, seconds apart defines the bandwidth 


10g logo (Ay/Ag) 
Tyrlogye 


(3) 


A requirement is that only one formant influences the measure- 
ments. A certain amount of prefiltering is generally necessary 
especially in back vowels where F, comes close to F, and is of 
comparable intensity. 

The Mingograph in itself is merely an oscillograph designed for 
general laboratory use and not particularly for speech research. 
It is, however, a very useful tool for phonetic research if equipped 
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with accessories for automatic extraction of voice fundamental, 
formant frequencies, intensity, and several other parameters repre- 
senting important aspects of the signal structure. Such accessories 
have been constructed or are under development at the Speech 
Transmission Laboratory of the Royal Institute of Technology 
in Stockholm but are not commercially available. Because of the 
very low costs for the recording material and the immediate display 
of several synchronous parameters, it is probable that these tech- 
niques will change the general attitude towards phonetic mass inves- 
tigation, especially with regard to the study of prosodic categories 
such as sentence stress and intonation and word accent or any 
other phonetic category that is related to duration, intensity, and 
fundamental pitch of the speech wave. These techniques will also 
provide efficient means for storing quantitative data on the spectral 
distribution of speech energy as a supplement to the frequency- 
intensity-time Visible Speech spectrograms obtainable with the 48- 
channel recorder. 

Data requiring a lifetime of work to process with classical tech- 
niques can now be handled within a week. Perhaps, it is more ade- 
quate to state that the data processed by these techniques in a 
period of a week can keep a phonetician busy for a lifetime of work 
if he cares to make maximal use of the data. It then becomes a 
serious problem to know how far to develop the techniques 
before mass production is started. A technical improvement of 
a detail in the recording procedure that can facilitate the practical 
interpretation of the data may mean. a substantial saving of time. 
This is one of the reasons why these techniques have not been made 
much use of yet in any larger project. They are still under 
development. 


3.22 Automatic Pitch Recording Devices 

One of the most valuable instrumental developments that can 
be used in conjunction with an oscillograph is the automatic pitch 
extractor designed by Griitzmacher and Lottermoser’®-** twenty 


78 M. Griitzmacher, W. Lottermoser, ‘Uber ein Verfahren zur trag- 
heitsfreien Aufzeichnung von Melodiekurven’, Akustische Zeitschrift, 1937, 
242—248. 

76 W. Kallenbach, ‘Eine Weiterentwicklung des Tonhdhenschreibers 
mit Anwendungen bei phonetischen Untersuchungen’, Akustische Beithefte, 
1951, Heft 1. 
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years ago. The reason it has not come to. extensive use is that the 
combination of oscillograph and pitch extractor is expensive and 
difficult to handle. It works well with some voices’® and not very 
well with others’? and needs optimal adjustment for every special 
type of voice. Noise or power line hum superimposed on a recording 
may seriously distort the pitch curve. Some of these difficulties 
have been removed during the last years’ engineering developments 
of analysis-synthesis-telephony systems in which F)-extracting 
devices are of crucial importance. A hundred per cent performance 
has not yet been obtained, but most of the present variants of the 
original device may be used if the investigator is aware of the 
tendencies of misbehaviour and can correct the curves accordingly. 
In combination with the Mingograph recorder the Griitzmacher 
method becomes easier to handle, since no photographic develop- 
ments are involved and since the paper cost is of no concern. 


3.23 Automatic Extraction of Formant Frequencies 


The communication engineering research into systems for anal- 
ysis-synthesis-telephony systems has resulted in some designs”® 
for automatic extraction of voltages proportional to the frequencies 
of the three first formants. Such devices are not yet very reliable 
in operation. It is hard to build into the instrument the judgement 
of a phonetically experienced observer, identifying and labelling 
the formants in correct order and deciding whether an energy 
maximum in the spectrum represents a single formant or a formant 
group. 

One simple type of formant detector is a frequency counter that 
measures the time length of a period of formant oscillation, similar 
to the operation of the F,-detector. Bandpass prefiltering is re- 
quired. This is merely a method of mechanizing the procedure 
of making direct observation on the waveform of an oscillogram. 
An additional advantage is that a lower paper speed can be used 
for the recording. 


77 See the tone curve on pages 70—74 of L. Hegediis, ‘Neue Methoden 
in der Erforschung der Diphtonge’, Zeitschrift fir Phonetik 9, 1956. No. 1, 
31—74. 

78 See the tone curve on page 130 of W. Stiiben, ‘Poesie und Prosa’, 
Zeitschrift fiir Phonetik 7, 1953. No. 1/2, 128—136. 

79 Jj. L. Flanagan, ‘Automatic Extraction of Formant Frequencies 
from Continuous Speech’, J. Acoust. Soc. Am. 28, 1956, 110—117. 
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These methods might become useful for phonetic research as 
a supplement to time-frequency-intensity spectrograms. A spec- 
trographic display will, however, always constitute the most im- 
portant acoustic reference in speech analysis. 


3.24 Sweep Frequency Analysis of Sustained Sounds 


Several acoustic investigations into the spectral composition 
of sustainable sounds have been performed in the period both before 
and after the development of Visible Speech apparatus.1” 1% 23°24 81, 82 
In some of the earlier studies!” the subject had to sing the vowel 
to be analysed for a time of several minutes during which the centre 
frequency of the analysing filter was slowly shifted from zero 
frequency to the upper limit of the analysis, thus tracing the spec- 
trum harmonic after harmonic. The special instrument used for the 
‘Suchtonanalyse’ by these investigators!’ was not very efficient. 
The technical aspects of various methods have been explained in 
detail by Meyer-Eppler.*° The speech analysis performed by the 
author at the Ericsson Telephone Company in 1945-194873-*4 was 
mainly based on a sweep frequency technique. The subjects had 
to sustain the voiced sounds for a period of 4 seconds during which 
an analysis from 0-4000 c/s was performed with a 50 c/s analysing 
filter. Recently this technique of frequency analysis has been 
adopted at the Royal Institute of Technology for making quick 
studies of sound qualities. The analysis time has been shortened 
to 3 seconds for a 31 c/s wide filter and to 3/4 seconds for a 62 c/s 
filter. The limiting rule for the sweep time is that the centre frequency 
should be changed less than B c/s in the time 1/B where B is the 
bandwidth of the analysing filter. The Mingograph is ideal as a 
recorder for sweep frequency analysis because of its relatively 
large recording bandwidth and thus rapid response to signal changes. 
This is the quickest available method at present of obtaining a high 
quality permanent record of a harmonic spectrum of a sound. One 
must simply hold the sound for a period slightly longer than in a 


80 ‘W. Meyer-Eppler, ‘Die Schwingungsanalyse nach dem Suchton- 
Verfahren’, Archiv dey Elektr. Ubertragung 4, 1950, 331—338. 

81 Y. Ochiai, T. Fukumura, ‘Timbre Study of Vocal Voices’, Memoirs 
of the laculty of Engineering, Nagoya University, vol. 5, 1953, 253—280. 

82 Y. Ochiai, T. Fukumura, ‘Beitrage zur Erkenntnis der Klangfarben- 
struktur bei vokalischen Klangbildern’, Memoirs of the Faculty of Engi- 
neering, Nagoya University, vol. 8, 1956, 1—10. 
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stressed syllable and then tear off a piece of paper from the Mingo- 
graph showing the detailed spectrum. Some samples analysed with 
a 32 c/s wide filter:are shown in Fig. 12. The harmonic structure 
comes out very clear. A greater amplitude range and a more de- 
tailed resolution can be obtained by this method than by making 
sections with the Sonagraph as can be seen from a comparison 
with Fig. 1. No high frequency pre-emphasis was utilized in the 
recording of Fig. 12. The top diagram of the assembly shows the 
spectrum of the synthetic neutral vowel [3] standardized in the 
laboratory as having formant frequencies of odd integers of 500 c/s, 
formant bandwidths of 100 c/s, a pitch of 100 c/s, and a voice 
source of -12 dB/ octave in the analog production. These charac- 
teristics are retained up to a frequency of 3000 c/s. Curve No. 2 
from the top is the spectrum of a sound produced by the author 
with an effort to simulate the neutral vowel. As can be seen the 
main shape is very similar but F, and F, have come a little too 
close. Also the fourth formant is of higher intensity. Next the spectra 
of [a u i] are shown. Observe the ‘single’ formant structure of the 
Swedish [u] pronounced with very small lip-opening and narrow 
tongue pass. The higher formants are very weak in [u] but apparent 
in [i]. Observe that F4 of [i] has the highest level of the formants 
within the upper formant group. 


3.25 Bandpass Intensity-Time Recordings 

The sections of stops and fricatives shown in Fig. 5 were ob- 
tained from an assembly of data from 35 separate oscillograms of 
speech intensity versus time within the frequency band of a 150 c/s 
wide filter of a wave analyser set to a different mid frequency for 
each oscillogram. 

Recently a special wave analyser instrumentation has been 
taken into use at our laboratory employing a set of 6 filters each 
of bandwidth variable in steps from 32 to 500 c/s and arranged 
with an overlap by a factor 2. The rectified and smoothed outputs 
from each of the 6 filters can be recorded simultaneously on the 
Mingograph. This is accomplished by means of a multiplex system** 
for recording 2 signal functions on each of 3 of the 4 Mingograph 
channels. The multiplex system consists of an electronic switch 
that alternatively connects each of the two signal sources, e.g. 


88 The multiplex unit has been designed by A. Moller, who has also 
participated in the development of other Mingograph accessories. 
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Fig. 12. Sweep frequency analysis of the synthetic 
neutral vowel [3] and some sustained vowels, subject 
G.F. A 31 c/s analysing filter centred at 1000 c/s 
and 2 steps of modulation were utilized. Sweep 
speed 4000 c/s in 3 seconds. Mingographic recording. 
No high frequency pre-emphasis. If a 62 c/s filter 
is utilized the time needed for analysis will be shor- 
tened to 3/4 seconds. 
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two of the bandpass filter units, to the recording channel. The 
switching is synchronized by a timing signal from a separate track 
of the tape recorder on which the speech material has been stored. 
The switching is performed with intervals of 10 milliseconds. This 
provides a sufficiently closely defined time scale permitting the 
synchronously sampled data from the various curves to be assem- 
bled. When a high accuracy is to be maintained only one filter is 
coupled to a single recorder channel and the synchronized switching 
is merely used as a time mark. In this way it is possible to store the 
intensity versus time data in a large number of more or less narrow 
filter bands for future reference in the analysis. This technique 
originates from the analysis of Swedish stop sounds in 194874 which 
in turn was inspired by the octave band oscillographic studies per- 
formed by Trendelenburg.'® 

The only disadvantage of this method is that it may take some 
time to identify and order the synchronous intervals from separate 
oscillograms. If many successive sections are to be sampled this 
is no objection and it may be recommendable to perform some pre- 
editing of a tape recording to contain a maximum of segments to 
be analysed. One advantage inherent in the time marking tech- 
nique is that the time location of a sample is well known. The range 
of intensity levels that can be measured is limited by the spectral 
level of noise in the initial recording only and is thus higher than 
what can be obtained by means of Sonagraph sectioning. 

A narrow band ‘sectioning’ by means of this assembly technique 
requires a large number of oscillograms to be taken. If the multi- 
plex device is used, enabling a number of 6 oscillograms to be record- 
ed simultaneously on the 10 cm wide Mingograph paper, the 
speech material has to be played back 25 times to produce a spec- 
trum sampling up to 3000 c/s with 20 c/s frequency intervals be- 
tween adjacent frequency bands. 

If the major purpose of the multi-bandpass veicoting! is to obtain 
data on the intensity levels of formants that have been identified 
and measured with regard to frequency in the 48-channel spectro- 
graphic display, it is sufficient to make a mingographic recording 
of the outputs from a rather small number of broad bandpass fil- 
ters, perhaps 10-20 bands that have a frequency width of the order 
of 500 c/s. This analysis requires that the speech will be run through 
the analysing and recording instrumentation 2-4 times only. It 
may be convenient to supplement the analysis with frequency 
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counters that give a running measure of the centre frequency of 
speech signals within some of the lower bands. This can be made 
a routine procedure for processing all speech material collected in 
a phonetic investigation. Another routine recording of greater 
immediate importance to the phonetician is the simultaneous dis- 
play of the fundamental pitch Fy and speech intensity as time 
varying quantities. 


3.26 “Intensity Measurements 


Intensity is synonymous with power and is of the dimension 
energy per unit time. If a speech sound has constant intensity, i.e. 
builds up immediately and stays at a constant level until the sound 
is cut off, there exists the ideal condition for expressing the energy 
of the sound as the product of its intensity and duration. This ideal 
condition is seldom approached in the speech wave. First of all it 
should be acknowledged that the intensity variations within a 
fundamental voice period are of no significance in this connection. 
Intensity is therefore expressed as an average value per voice 
period or as an average over any small unit time interval of a dura- 
tion comparable to the voice period. The term mean speech power 
introduced by H. Fletcher! is thus defined as an average over a 
time of 10 milliseconds. 

An intensity meter is a device that produces an electrical volt- 
age which represents but is not necessarily proportional to the 
intensity of the speech wave. This voltage must be recorded by 
an oscillograph of some sort, e.g. the Mingograph. The energy of 
any segment of the speech wave, e.g. a syllable, is apparently pro- 
portional to the area under the intensity curve within the time 
interval under consideration. This implies that the recorded voltage 
be strictly proportional to the intensity, which is generally not the 
case. The technical process called rectification in the intensity 
meter generally produces a voltage proportional to the square 
root of the intensity. This is a so-called linear rectification and the 
ideal performance providing true intensity measure is called square 
law rectification. Since both instruments can be calibrated to give 
the same readings for a sine wave and since the error involved when 
measuring speech is a couple of dB at the most, it is considered 
sufficient to use the linear rectification system. The area under 
the intensity curve originating from an intensity meter with linear 
rectification is thus not energy in a strict sense, although it is of 


g 


58 


the same conceptual dimension. The term ‘impulse area’ is suggest- 
ed for this area measure to be used as a physical correlate to 
stress. 

An intensity meter contains, or can be composed of the following 
successive units: pre-emphasis filter, rectifier, smvothing filter, 
amplitude compression unit. The role of the pre-emphasis filter, 
if any, is to adjust the relative weight of the contributions from 
lower and higher parts of the speech spectrum in conformity with 
the frequency dependent sensitivity of the auditory system. 

The same kind of pre-emphasis filters as contained in a standard 
sound level meter for noise measurements may be used. Three 
alternative filter settings are generally included, of which the third 
provides no filtering at all and is intended for measurements of 
sounds of a level higher than 85 dB. This is the C-curve. The B- 
curve provides a small amount of base reduction, 6 dB at 100 c/s, 
and is intended for sound levels of 55-85 dB. The A-curve finally 
is used at sound levels below 55 dB. It provides as much as 22 dB 
attenuation at 100 c/s relative to the reference level at a frequency 
of 2500 c/s, where the ear is maximally sensitive. There is also 
some attenuation of the highest part of the spectrum, above 3000 c/s. 

In normal conversation at a distance of one meter the sound 
pressure level of speech is of the order of 65 dB. The B-curve should 
therefore be representative for measuring the level of vowel sounds. 
Unvoiced consonants, on the other hand, are about 20 dB weaker 
than the vowels and it would therefore be more appropriate to 
use the A-curve for measuring them. The intensity relations within 
the class of unvoiced consonants are not radically affected by the 
particular choice of pre-emphasis filter. Greater differences are ob- 
served within the class of voiced sounds since the level of the first for- 
mant varies much less than the levels of higher formants and is physi 
cally more intense. The second and higher formants do not influence 
measurements much except when the base is reduced by means 
of a B-curve and even more pronounced by means of an A-curve 
pre-emphasis. 

The rectifier characteristics have already been discussed. The 
function of the rectifier unit is to convert the alternating current 
speech signals into a direct current. A full wave rectifier should 

84 This measure seems to be equivalent to ‘Gesamtlautstarke’ as de- 


fined by Maack. A. Maack, ‘Héchstlautstarke und Durchschnittslautstarke ’, 
Zeitschrift fir Phonetik 7, 1953. No. 3/4, 213—230. 
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be used in order that both the parts above and below the zero 
line of the speech wave should be made use of. In a half wave recti- 
fier only the positive or the negative part of the speech wave is 
utilized. Since the speech wave generally is asymmetric especially 
for deep male voices, and since the rectifier characteristics are 
seldom strictly linear, there exists the possibility that a phase re- 
versal of the connecting cords from the tape recorder to the inten- 
sity meter will result in slightly different intensity values. This 
effect can be found when using the amplitude display unit of the 
Sonagraph; but it is not large enough to cause any considerable 
ambiguities. 

The smoothing filter is an averaging device intended for the 
removal of short time fluctuations of speech intensity. The averag- 
ing time or integration time of this lowpass filter is of the order 
of the reciprocal value of twice the bandwidth, ie. T; = 1/2 B 
where B is the cutoff frequency of the filters and T; the integra- 
tion time. 

The standard value of the cutoff frequency of the smoothing 
filter used in most speech intensity measurements at the R.I.T. in 
Stockholm is 50 c/s. The integration time then comes close to the 
10 milliseconds value recommended by Fletcher. The integration time 
of the Sonagraph amplitude display is also of the order of 10 milli- 
seconds. Intensity curves of deep male voices will show a super- 
imposed ripple which is a residue of the voice fundamental. It can 
be made useful for measuring the voice fundamental frequency 
F, by the classical means of converting period length T, to frequency, 
F, = 1/To, for instance by the mechanical graphical method devel- 
oped by E. A. Meyer.® This instrument has the advantage of pro- 
viding a logarithmic frequency display, i.e. a constant number of 
semitones per mm. There is of course no harm in using a smoothing 
filter of a higher cutoff frequency than 50 c/s. When Fs is high 
this is necessary if the voice ripple is to be seen on the intensity 
curves. It has also the advantage that the intensity variations 
within a stop sound can be studied more accurately. It is thus 
convenient to have a stepwise variable cutoff frequency in the 
smoothing filter. 

An amplitude compression unit is needed for a logarithmic 
display of intensity. The logarithmic character of an intensity curve 
implies a calibration with a constant number of decibels per milli- 
meter of the curve. It is not necessary to have a strictly logarith- 
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mic curve, but some degree of compression is useful for extending 
the measurable amplitude range. A compressed scale is linear 
for small signal amplitudes and logarithmic for larger amplitudes. 
Vowels generally differ comparatively little in intensity even when 
different contexts with regard to stress are taken into account. 
For studies of accents and stress it is therefore possible to obtain 
more sensitive measures if a linear display is utilized. Further 
the measure of the area under the intensity curve in a syllable is 
well defined only if a strictly linear display is utilized as required 
in the definition of impulse area. When only intensity measures 
are of interest it may be preferable to have a compressed amplitude 
scale, especially when both weak consonants and more intense 
vowels are to be measured from the same recording. It is accordingly 
desirable to have means for an alternatively linear or compressed 
amplitude scale in the intensity meter. 

Design criteria based on hearing*®*¢ would require a logarithmic 
intensity scale, but this is of no concern if the measured intensity 
values are expressed in decibels. A decibel calibration is therefore 
utilized even for a linear amplitude display. The smoothing filter 
should have a bandwidth of the order of 10 c/s only for simulating 
an auditory integration time of 50 milliseconds presumably valid 
for short bursts of white noise. A 50 millisecond ‘smear time’, 
adopting the terminology of Joos,‘ is probably a reasonably signif- 
icant average value for the auditory time constant. A cutoff fre- 
quency as low as 2.5 c/s in the smoothing lowpass filter is required 
if the loudness perception of sine waves is to be simulated. This 
corresponds to a time constant of 200 milliseconds. However, 
speech is neither bursts of white noise nor pulses of sine waves, 
and the inertia effects in the auditory system are more complex 
than indicated above. The effect of short time auditory fatigue 
decreasing the sensitivity of the ear for a very short time after the 
offset of a sound of high intensity should also be taken into account. 
The magnitude of this effect in speech perception is not very well 
known, and the masking effect of low frequency high intensity for- 
mants on weaker formants in a higher frequency range has not been 
much studied either. There exist neither instruments nor formulas 
for reasonable accurate calculations of loudness of complex time 

86 The psychophysics of hearing can be studied in books like S. S. 
Stevens, H. Davis, Hearing, New York 1938, 1947, and 
86 S.S. Stevens, Handbook of Experimental Psychology, New York 1951. 
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variable stimuli like speech. As a rule of thumb for the significance 
of the decibel scale it can be mentioned that the smallest audible 
intensity differences are of the order of 0.5—1 dB and that an 
increase in level by 10 dB is subjectively appreciated as a doubling 
of loudness.®? It is of limited interest only to smear out the intensity 
variations of connected speech by smoothing filters that simulate 
the large inertia of the auditory system. It is more recommendable 
to use the 50 c/s smoothing filter as a standard so that both the 
crest values within a syllable and its impulse area can be measured. 

The intensity meter described above has not been in use long 
enough for the practical phonetic significance of various combi- 
nations to be clear. It is thus not a ready technique but rather 
suggestions for further experiments that are offered here. One 
general observation worth noting is that the various instruments 
and measurements now available lead to not very different results 
on a purely relational basis. Even the oscillogram itself displaying 
the original waveform of the speech wave can be used as a sub- 
stitute for the intensity curves when evaluating vowels and voiced 
sounds in general. The maximum amplitudes and the average 
amplitudes of an intensity curve have a close correlation as shown 
by Maack.* 

When evaluating results from intensity measurements of vowels, 
it is advisable to relate intensity data to. their average values for 
each particular sound if the data shall reflect the relative voice 
effort in the production. We do not possess a reliable empirical 
basis for stating to what degree the feel of stress when emphasizing 
a specific part of an utterance is correlated to relative intensity 
data and to what extent sonority data alone suffice as stress corre- 
lates.88 Any speech wave data may, however, be related directly 
to the linguistic structure, that is, without the support of a psycho- 
logical interpretation. The tendency towards lengthening is the 
most obvious feature observed as a physical correlate to stress in 
conformity with observations on synthetic speech.®® There is also 
the clearer vowel-consonant contrast as previously discussed and 
a higher intensity and fundamental tone level. 


87 S.S. Stevens, ‘Calculation of the Loudness of Complex Noise, J. 
Acoust. Soc. Am. 28, 1956, 807—832. 

88 A.C. Gimson, ‘The Linguistic Relevance of Stress in English’, Zeit- 
schrift fiir Phonetik 9, 1956, No. 2, 143—149. 

8° DPD. B. Fry, ‘Duration and Intensity as Physical Correlates to Stress’ 
J. Acoust. Soc. Am. 27, 1955, 765—768. 
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In this connection it should be pointed out that the ‘mouth 
line’ on a kymograph of the classical type into which the subject 
speaks directly through a tube is not an intensity measure. The 
superimposed voice ripple, utilized for pitch measurements, is the 
same as that in the modern intensity meter described above, but 
the mean value of the curve is merely a measure of the varying 
static air pressure in the mouth cavity during articulatory open 
intervals. 

When interpreting Mingograph recordings it becomes a problem 
to identify the time intervals occupied by the various sounds of 
the utterance. This segmentation process is best carried out with the 
aid of supplementary broad band spectrograms. Ordinarily, a simul- 
taneous recording is made of intensity, tone curve, and oscillogram. 
These data provide a more effective basis for the segmentation 
than an oscillogram alone. Since the Mingograph is limited to the 
0-800 c/s frequency range, it is apparent that unvoiced sounds 
occupying a higher frequency region cannot be detected in the os- 
cillogram. Dental stops and fricatives are especially useful topo- 
graphical references for mapping the sequential segments of speech. 
A rectification process, as utilized for producing the intensity 
curve, is needed for their portrayal. One technical trick for making 
the high frequency sounds appear clearer in an oscillogram is to 
make separate use of positive and negative signals. The negative 
part of the oscillogram can be replaced by a highpass filtered 
rectified function of the speech wave. A highpass filter or rather 
a frequency correction network providing increasingly larger attenu- 
ation at frequencies below 4000 c/s has been used with some suc- 
cess. Negative signals on this ‘duplex oscillogram’ will indicate pre- 
sence of appreciable sound energy in the high frequency region. 
The voiced-voiceless distinction shows up very clearly as presence 
versus absence of low frequency periodic oscillations in the negative 
part of the oscillogram. 


3.27 Acoustic Display of Prosodic Features 


A. Juncture. A few examples of composite Mingograph record- 
ings of intensity, duplex oscillographic curve, and fundamental 
pitch, are shown in Fig. 13, 14, 15. In all these figures the top curve 
intensity is measured with an A-curve pre-emphasis and a 50 c/s 
smoothing filter. The second curve from the top represents intensity 
measured with the B-curve pre-emphasis, 50 c/s smoothing filter, 
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and a linear amplitude display. The third curve is the duplex oscil- 
logram and the fourth is the fundamental pitch curve produced by 
a modified Griitzmacher method. In addition a Sonagraph spec- 
trogram of the utterance is shown below the pitch curve, in Fig. 
14 and 15, also together with the Sonagraph amplitude display 
curve. A technical novelty in the form of an automatic zero fre- 
quency line calibration of the sonagram has been introduced, as 
can be seen from the narrow zero line on all the spectrograms. These 
have been produced with expanded frequency scale. 

Fig. 13 illustrates some juncture phenomena. To the left the 
sentences, ‘I said Cubanize not Cuban eyes’ and to the right, ‘Have 
you seen the meat?’ and ‘Have you seen them eat ?’ A glottal stop 
denotes the word boundary between ‘Cuban’ and ‘eyes’. The final 
position of the [m] in ‘them’ is signalled by its shorter duration 
compared to the lofiger initial [m] of ‘meat’, and this difference is 
complemented by the shorter vowel length in ‘meat’ compared to 
‘eat’. There is also a greater consonant-vowel intensity contrast 
in the [mi] of ‘meat’.* 


B. Swedish Word Accent. The Swedish word accent separating 
words with accent 1, ‘anden ‘the duck’, and those with accent 2, ‘anden 
‘the ghost’, is exemplified in Fig. 14. A narrow band spectrogram 
and a Sonagraph amplitude display is also shown here for compari- 
son. The one versus two-peak tone curve characteristic® for the 
single tone accent 1 versus the compound tone accent 2 in Stockholm 
pronunciation is apparent. The narrow band sonagram displays 
the pitch variations quite efficiently, and the amplitude display 
curve has a fair resemblance to the intensity curve on the top of 
the picture. The major exception is that the Sonagraph amplitude 
display is more influenced by the first formant since it lacks the 
base cut of the A-curve pre-emphasis utilized for the top curve. 
This is why the vowel-consonant contrast is more apparent in the 
top curve. 

According to Malmberg* the tone curve is the only significant 
acoustic factor involved in the accent 1 — accent 2 distinction. 
His synthesis experiments*®® are, however, based on southern Swe- 
dish,” where the tone pattern is almost the reverse of that in the 


* Examples and signification from Truby. 

*? B. Malmberg, Sydsvensk ordaccent, Lund 1953. 

%1 B. Malmberg, ‘Nyare fonetiska rén och deras praktiska betydelse’, 
Nordisk Tidskrift fér Déustumsundervisningen 2, 1957, 53—93. 
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Stockholm dialect. The second syllable of the accent 2 word has 
the same peak intensity as the second syllable of the accent 1 word 
but a larger impulse area due to the greater duration of the more 
intense parts, and thus a larger area under the intensity curve. 
The more prominent second syllable of accent 2 compared to accent 
1 conforms to the lexical stress transcription of 3-1 versus 4-0. 
In all other respects the intensity data fails to reflect the stress 
notation. The first syllables of both words have about the same 
intensity level and impulse area. It is also evident that the large 
ratio of stress of the first syllable to the second syllable implied by 
the transcription does not have any correspondence in the intensity 
data. The relative prominence of the second syllable of accent 2 
is probably not the major intensity characteristic. The auditory 
impression of larger separation between the syllables in accent 2 
even called two-syllable accent, referred to by Gjerdman® in a 
discussion of the possible importance of the intensity cue, has a 
correspondence in the overall shape of the intensity curve which 
reflects a displacement of the intensity maxima of the first and 
second vowel further to the beginning and to the end of the word 
respectively. In the simple tone accent on the other hand the inten- 
sity is more concentrated in the middle of the word. The intensity 
variations thus largely reflect the tone variations. This is not a 
coincidence, since an increase in pitch at constant voice effort 
causes an increase in intensity. The physical explanation of this 
effect is that the voice source emits a larger number of equal energy 
pulses per second at the higher pitch. The intensity variations 
are thus largely conditioned by the tone variations. A similar close 
correlation between intensity and tone curve has also been found 
by Malmberg*! for southern Swedish pronunciation, thus supporting 
his theory that intensity does not play an independent role for the 
Swedish word accent. Further investigations are needed, however, 
to reach a deeper insight into these questions, especially with regard 
to different dialects. 


C. Varying Sentence Stress. A variation of sentence stress is 
exemplified in Fig. 15. The sentence I morgon flyger jag till Stock- 
holm was pronounced 5 times with the main emphasis successively 
shifted from the second word to the last word. The two intensity 
curves on the top are the same as in the previous figures, i. e. 


92 ©. Gjerdman, ‘Accent 1 och accent 2, akut och gravis’, Nysvenska 
Studier XXXII, 1954, 125—154. 
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A-curve frequency pre-emphasis, compressed amplitude scale and 
B-curve pre-emphasis linear amplitude scale. The transcription is 
given between the duplex oscillogram and the pitch curve below 
which follows the Sonagraph amplitude display and a broad band 
spectrogram. The three different intensity curves are included for 
instrumental comparison only. They do not differ much in general 
appearance, but it can for instance be seen that the [s] — sounds 
are relatively more apparent in the top curve and that the second 
curve displays the syllabic structure best. 

The sentence stress shows up as both a higher fundamental 
pitch, higher intensity, and greater duration of the syllable that 
carries the main emphasis. There are also less omissions and assim- 
ilations in the stressed positions. All these factors are consistently 
found. The pitch increase is generally seen as a relatively unselec- 
tive rise in the average tone level in the neighbourhood of the 
emphasized word. Observe how the stress on the final word Stock- 
holm affects the second syllable more than its first syllable. This is 
also the case with the other accent 2 word of this sentence, wiz. 
morgon. The second syllable absorbs the greater part of the sentence 
stress and becomes physically more intense than the first syllable. 

The increased articulatory precision which always accompanies 
the stress can be studied in detail in a display of the type offered 
in Fig. 15, where all the physical evidence from the speech wave 
is displayed. The spectrogram is very helpful in this respect. The 
word itll is pronounced as [til] in the stressed position only and 
otherwise as [ta]. The impulse area of this syllable increases in in- 
verse proportion to the distance from the stressed words. The [r] 
in morgon shows 3 complete rolled periods in the emphasized version, 
sentence No. 1, and otherwise there are one or two flaps only. The 
voiced palatal stop [g] is produced with incomplete closure phase 
in the third and the fifth sentence as can be seen from the unbroken 
F2 and F3 traces.* This is a typical instance of reduced consonant- 
vowel contrast in unstressed positions, basically due to an incomplete 
closure in the median part of the mouth. The vowel [e] which should 
follow [g] is heard rather indistinctly, since it is absorbed by the 
first part of the glide [ja], except in the case of the third sentence 
where it is associated with a separate pulse in the intensity curve. 
There is in no place any acoustic trace of a consonant [r] in /lyger. 

None of these observations have any considerable linguistic 

* Observe the F2 F3 proximity typical for palatals. 
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novelty, but they are mentioned to show the methods of portrayal 
and the capabilities of the instruments. The new techniques of 
speech analysis are acoustic in nature, but they can give the linguist 
a better insight into what has been said on a special occasion and 
also how the speech has been produced. This information can also 
be gained at a considerably reduced price with regard to recording 
expenses and the time needed for the analysis. 


3.3. Instrumentation for Speech Synthesis 


3.31 The Redundancy Problem from an Engineering Point of View 


The communication engineering interest in speech research is 
generally concentrated on the theory and design of analysis-synthe- 
sis-telephony systems. These devices will presumably enable the 
engineer to transmit perhaps 30 simultaneous telephone calls over 
a line originally intended for one call only, thus reducing the neces- 
sary bandwidth for transmission of speech to something of the order 
of 200 c/s. Alternatively, the engineers count on obtaining better 
quality on radio links that are disturbed by a very high noise level 
or other distortions. 

The means for achieving this would be to extract at the trans- 
mitting end a large part of the redundant detail structure from the 
speech. wave and to transmit only a few relatively slowly varying 
signals containing the invariants, the ‘information bearing elements 
of speech’. At the receiver a synthetic speech is remade more or 
less or not at all resembling the original speech. The transmitted 
signals have the function of controlling the synthesis process, but 
they do not enter the final product. The general acoustic capabili- 
ties of the speaking machine can thus be regarded as a prestored 
redundancy restoring the body of the speech wave that was re- 
moved at the transmitting end. 

Redundancy here implies with regard to a specific communi- 
cation criterion of producing either natural speech that sounds 
similar to that of the original or a stereotype but intelligible speech. 

Speech synthesis has attracted a lot of interest and attention, 
but it has proved to be an easier task tq produce high quality syn- 
thetic speech than to perform a mechanized analysis of the control 
signals needed for a continuous and simultaneous synthesis. It is 
apparent that any advance in the theory of acoustic specification 
of speech for a descriptive purpose can have practical applications 


| 
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in the technique of analysis-synthesis-telephony. This is one of the 
major selling points in the engineers’ speech research. Analysis- 
synthesis-telephony is related to ordinary telephony as art to pho- 
tography. A redundancy reduction is essential in both instances. 


3.32 Some Instruments for Synthesis 


A. The Pattern Playback. The art of producing synthetic speech 
by spectrographic pattern playback, which by now has made 
Haskins Laboratories famous, is based on an initial representational 
stage of painted stylized formant patterns with white paint on 
a plastic base belt. A spectrum of harmonically related sine waves 
of light produced by a tone wheel is projected on the moving plastic 
belt in a frequency versus place order that conforms to the fre- 
quency calibration of the spectrogram. Those harmonics that are 
reflected by the painted lines are collected in a photocell, ampli- 
fied, and passed on to a sound recording and reproduction system 
which permits an immediate auditory check of the painted utter- 
ance. This synthetically produced speech is completely monotone 
due to the fine structure of harmonics to a fundamental pitch of 
120 c/s. By random interruptions of the painted pattern it is possible 
to simulate unvoiced sounds. The Haskins playback machine has 
provided a wealth of empirical knowledge of the significance of 
various spectrographic pattern aspects.?” 


B. The Vocoder. The Haskins Laboratories are also pioneers in 
using a vocoder for research* in linguistic problems. The vo- 
ocder® 9% % is an analysis-synthesis system in which the data con- 
cerning the spectral distribution of speech energy is represented 


3 A.M. Liberman, P. Delattre, F. S. Cooper, ‘The Role of Selected 
Stimulus Variables in the Perception of the Unvoiced Stop Consonants’, 
The Am. J. Psych, LXV, 1952, 497—516. 

% P. Delattre, A. M. Liberman, F. S. Cooper, ‘Speech Synthesis as a 
Research Technique’, Proc. of the VI Int. Cong. of Linguists, London 1956. 

%5 A.M. Liberman, ‘Some Results of Research on Speech Perception’, 
J. Acoust. Soc. Am. 29, 1957, 117—123. 

* The Bell Telephone Laboratories have earlier made experiments on 
spectrographic pattern playback with the aid of a voder equipment, i. e. 
essentially the receiving part of a vocoder, see ref. 96. 

% H.W. Dudley, ‘Fundamentals of Speech Synthesis’, Bell Telephone 
System Monograph 2648, 1956; see also L. O. Schott. 

%7 J. M. Borst, F. S. Cooper, ‘Speech Research Devices, Based on a 
Channel Vocoder’, 53rd Meeting of the Acoust. Soc. Am., Paper M4, 1957. See 
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by a number of voltages describing the rectified and smoothed 
intensity variations within each of a number of bandpass filters 
covering the speech frequency range. These signals, produced at 
the transmitting end, control the synthesis at the receiving end 
by a modulation process in each of a number of bandpass channels 
that correspond to the bandpass filters at the transmitter. The 
receiver is capable of producing either voiced or unvoiced sounds 
or both simultaneously. The frequency of the voice fundamental 
constitutes one important signal parameter to be transmitted. It is 
possible, with the’appropriate accessories, to reproduce speech with 
a fairly natural quality of the sounds but with a synthetically 
introduced pitch inflection curve. Studies of the differential role 
of the fundamental pitch can thus be carried out.* Systematic 
intensity and quality changes can also be made.* 

There are many different types of speech synthesizers in oper- 
ation at places like the Bell Telephone Laboratories,** the Haskins 
Laboratories,®” the Acoustic Laboratory of Massachusetts Institute 
of Technology,** 1°2 the North Eastern University in Boston, 
the Ministry of Supply in England,! the University of Edinburgh,! 
and the British Post Office. Most of these synthesizers have 
some features in common. There are 3 main categories: 


also other papers presented at this session by members of the Haskins 
Laboratories. 

98 B. Malmberg, ‘Observations on Swedish Word Accent’, Haskins 
Laboratories Research Reports, 1955. 

9° F. Vilbig, K. H. Haase, ‘Some Systems for Speech-band Compres- 
sion’, J. Acoust. Soc. Am. 28, 1956, 573—577. 

109 J. L. Flanagan, A. S. House, ‘Development and Testing of a For- 
mant-Coding Speech Compression System’, J. Acoust. Soc. Am. 28, 1956, 
1099—1106. 

101 S.-H.Chang, ‘Two Schemes for Speech-band Compression’, J. Acoust. 
Soc. Am. 28, 1956, 565—572; see also C. R. Howard, ‘Speech Analysis- 
Synthesis Scheme Using Continuous Parameters, J. Acoust. Soc. Am. 28, 
1956, 1091—1098. 

102 G. Rosen, K. N. Stevens, J. M. Heinz, ‘Dynamic Analog of the Vocal 
Tract’, J. Acoust. Soc. Am. 28, 1956, 767 (A). 

108 W. Lawrence, ‘The Synthesis of Speech from Signals which have a 
Low Information Rate’, in Communication Theory, ed. W. Jackson, London 
1953. 

104 A listener’s response to a synthetic test word undergoes phonemic 
shifts when the calibration of the formant pattern of the immediately pre- 
ceding introductory phrase is changed. P. Ladefoged, D. E. Broadbent, 
‘Information Conveyed by Vowels’, J. Acoust. Soc. Am. 29, 1957, 98—104. 
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(1) Spectral playback machines iike the Haskins PB2, de- 
scribed above. 

(2) Formant circuit synthesizers. This is the most common type 
and is found at all places above. 

(3) Configurative analogs, e.g. the Swedish LEA, and similar 
electrical analogs to the vocal tract at M.I.T. and B.T.L. 


The following presentation will concentrate on the speech syn- 
thesizers employed in the speech research at the Royal Institute 
of Technology in Stockholm.!% 


C. Formant Circuit Synthesizers. The Swedish OVE I} be- 
longs to the class of formant circuit synthesizers comprising a num- 
ber of resonance circuits one for each formant to be represented. 
The formant circuits can be arranged either in series as in OVE, 
or in one version of POVO,!® or in parallel as in the English ma- 
chines and in a B.T.L. device.!’ The latter system is more flexible 
when both vowels and consonants are to be produced with a minimum 
of circuitry but does not so easily provide the same naturalness of 
vowels. 

The formant frequencies, the frequency of the voice fundamental, 
and the onset of voice are manually controlled’ in OVE I, as 
indicated schematically in Fig. 16. The angular displacements of 
two rods, each attached to a potentiometer, determine the fre- 
quency tuning of F, and F, and thus an acoustic vowel figure in which 
any vowel sound may be placed. The position within the plane 
of the manceuvre board where the two rods meet in a joint consti- 
tutes a reference for the calibration. The fundamental pitch F, 
as well as the on-off switching of the voice source and these two 
formant frequencies are all varied by means of a single one-hand 
control. F; can be controlled separately, and a special press button 
pitch box can be used for the Fy-control to give a stepwise variation 
of the pitch as in song. 

This instrument is especially well suited for demonstrating the 
dependency of vowel colour on formant frequencies. Diphthongs 


105 C. G. M. Fant, ‘Speech Communication Research’, IVA (Royal 
Swedish Academy of Engineering Sciences) 24, 1953, 331—337. 

106 M. Joos, tef. 4, p. 82, utilized a similar device for vowel synthesis. 
Two formant frequencies and the on-off switch were included in a one-hand 
control. 

107 E. S. Weibel, ‘Vowel Synthesis by Means of Resonant Circuits’, 
J. Acoust. Soc. Am. 22, 1955, 858—S865. 
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Fig. 16. Simplified diagram illustrating the manual 
operation of the Swedish speech synthesizer OVE I. A 
fixed fourth formant is also included. The fundamental 
pitch, the on-off switch of the voice source, and the fre- 
quencies of the first and the second formants are all varied 
in a one hand control. The position of this control within 
the plane of the manceuvre board determines a point in 
the acoustic vowel diagram of F, versus F;. 


can easily be made and short sentences containing the consonants 
[w v j r lj can be simulated although there is an apparent lack 
of quality of some of the consonants. English speech is best suited 
because of the glides and the vocalic [r]. Sentences like ‘How are 
you?’ ‘Where are you?’ ‘I love you’ can be made natural enough 
to be recognized without any conditioning of the listeners. 

Since a continuous range of natural and acoustically well-defined 
vowels can be produced, there is some possibility that this instru- 
ment may become useful for standardizing vowel qualities and pho- 
netic symbols. In a simplified form it might also be used as a de- 
monstration tool in phonetic courses. 

A later development of this instrument, OVE II, is controlled 
by a photoelectric function generator that converts pre-drawn 
curves of the temporal variations of formant frequencies, funda- 
mental pitch and source character, and intensity into continuous 
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Photoelectrically controlled speech synthesizer 
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Fig. 17. Simplified diagram illustrating the operation of OVE II. A photo- 
electric function generator is employed for converting the data on formant 
frequencies and other pre-determined parameters to the appropriate control 
voltages. OVE II is intended for systemative studies of the relative impor- 
tance of various spectral parameters. 


instructions to the particular units within the speaking machine. 
The principle is shown in Fig. 17. It is similar to the operation of 
the English machine designed by Lawrence. OVE II is intended 
as a research tool for investigating the differential importance of 
* the main variables in speech. Not only distinctions of phonemic 
character but also problems of the acoustic correlates to speech 
naturalness can be investigated. The machine will not be used for 
any larger project of this type before it has been developed further. 
A more stable function generator is also needed. 


D. Configurative Analogs. LEA is a member of a quite different 
class of speech synthesizers. The electrical source simulating the 
function of the vocal cords is of the same type as that feeding OVE I 
and II and the end result is essentially the same. LEA is composed of a 
large number of coils and condensors that are related to the con- 
figuration of the air chambers within the vocal tract as shown in 
Fig. 18. There are 45 successive filter sections in LEA, each com- 
posed of a series coil and a shunt condensor, each section represent- 
ing a 0.5 cm thick slice of the cavities cut perpendicular to the 
direction of the air flow. The cross-sectional area of such a slice 
constitutes one point on the vocal tract area function describing 
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Electrical line analog 
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Fig. 18. Scheme and photograph of the electrical line vocal tract analog 
LEA. The cross-sectional area of the vocal cavities from the glottis to the 
lips constitutes the ‘area function’ in which all the articulatory information 
is contained. The area function can be visualized by the outline of the 
control knobs of the instrument as seen in the photograph. LEA can produce 
artificial speech sounds but is primarily used as an analog machine for com- 
puting the acoustic effect of pre-determined articulatory configurations. 
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the area variations from the vocal cords to the lips. This is the rele- 
vant articulatory information for predicting the acoustic behaviour 
of the system. In LEA each elementary section can be given one 
out of 16 area settings representing one out of 16 different cross- 
sectional areas ranging from 0.16 to 16 cm?. As can be seen from 
Fig. 18 the outline of the area function may be visualized by the 
contour of the control nobs of successive sections. 

LEA or similar electrical vocal tract devices at Massachusetts 
Institute of Technology* or the Bell Telephone Laboratories‘? are 
used either as analog computing machines for converting articu- 
latory data from X-ray pictures to corresponding data on the spec- 
tral characteristics of speech sounds in the form of formant fre- 
quencies or complete spectrum envelopes or merely as synthesizers 
for the production of artificial speech sounds. 

Contrary to the capabilities of OVE and other direct formant 
generating synthesizers that are more easily manceuvred, it is only 
possible to produce stationary sounds with LEA and similar devices. 
An exception is the dynamical electrical vocal tract analog now 
under construction at M.I.T.,!°* where the area settings are con- 
trolled by an electronic servo system. 

The fundamental investigations of the relations between arti- 
culation and formant patterns performed by Stevens and House*® 4’ 
were based on calculations with the aid of their static configu- 
rative analog. Similar theoretical work is undertaken with the Swe- 
dish LEA as a contribution to the general understanding of the 
speech mechanism, especially of formant-cavity relations. 

Experimental investigations with LEA can substitute for 
greatly time-consuming numerical calculations of the acoustic 
significance of various articulatory details. 


SUMMARY 


The report summarizes techniques of studying speech by means 
of acoustic analysis and synthesis with special emphasis on recent 
developments at the Royal Institute of Technology, Stockholm, 
aiming at the processing of large quantities of speech at low cost. 

The basic relations between articulation and speech wave are 
discussed, and methods of spectrographic and oscillographic analysis 
and classification of the essential signal structure of speech sounds 
are exemplified. 


77 


Acoustic correlates to vowel quality, stress, juncture, and word 
accent are discussed. Speech synthesis is described briefly with em- 
phasis on instrumentation developed in Sweden. 
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